Hi Anton,
jackson is the json extension as we have XML. Agree that it should be
documented.
Agree about join-library.
sketching is some statistic extensions providing ready to use stats
CombineFn.
Regards
JB
On 03/10/2018 20:25, Anton Kedin wrote:
Hi dev@,
*TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
understand.
*Current State:*
*
*
I was looking at `sdks/java/extensions`[1] and realized that I don't
know what half of those things are. Only `join library` and `sorter`
seem to be documented and discoverable on Beam website, under SDKs
section [2].
Here's the list of all extensions with my questions/comments:
- /google-cloud-platform-core/. What is this? Is this used in GCP IOs?
If so, is `extensions` the right place for it? If it is, then why is it
a `-core` extension? It feels like it's a utility package, not an extension;
- /jackson/. I can guess what it is but we should document it somewhere;
- /join-library/. It is documented, but I think we should add more
documentation to explain how it works, maybe some caveats, and link
to/from the `CoGBK` section of the doc;
- /protobuf/. I can probably guess what is it. Is 'extensions' the
right place for it though? We use this library in IOs
(`PubsubsIO.readProtos()`), should we move it to IO then? Same as with
GCP extension, feels like a utility library, not an extension;
- /sketching/. No idea what to expect from this without reading the code;
- /sorter/. Documented on the website;
- /sql/. This looks familiar :) It is documented but not linked from
the extensions section, it's unclear whether it's the whole SQL or just
some related components;
[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/
*Questions:*
- should we minimally document (at least describe) all extensions and
add at least short readme.md's with the links to the Beam website?
- is it a right thing to depend on `extensions` in other components
like IOs?
- would it make sense to move some things out of 'extensions'? E.g. IO
components to IO or utility package, SQL into new DSLs package;
*Opinion:*
*
*
Maybe I am misunderstanding the intent and meaning of 'extensions', but
from my perspective:
*
*
- I think that extensions should be more or less isolated from the
Beam SDK itself, so that if you delete or modify them, no Beam-internal
changes will be required (changes to something that's not an extension).
And my feeling is that they should provide value by themselves to users
other than SDK authors. They are called 'extensions', not 'critical
components' or 'sdk utilities';
- I don't think that IOs should depend on 'extensions'. Otherwise the
question is, is it ok for other components, like runners, to do the
same? Or even core?
- I think there are few distinguishable classes of things in
'extensions' right now:
- collections of `PTransforms` with some business logic (Sorter,
Join, Sketch);
- collections of `PTransforms` with focus parsing (Jackson, Protobuf);
- DSLs; SQL DSL with more than just a few `PTransforms`, it can be
used almost as a standalone SDK. Things like Euphoria will probably end
up in the same class;
- utility libraries shared by some parts of the SDK and unclear if
they are valuable by themselves to external users (Protobuf, GCP core);
To me the business logic and parsing libraries do make sense to stay
in extensions, but probably under different subdirectories. I think it
will make sense to split others out of extensions into separate parts of
the SDK.
- I think we should add readme.md's with short descriptions and links
to Beam website;
Thoughts, comments?
[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/