Hi dev@,

*TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
understand.

*Current State:*

I was looking at `sdks/java/extensions`[1] and realized that I don't know
what half of those things are. Only `join library` and `sorter` seem to be
documented and discoverable on Beam website, under SDKs section [2].

Here's the list of all extensions with my questions/comments:
 - *google-cloud-platform-core*. What is this? Is this used in GCP IOs? If
so, is `extensions` the right place for it? If it is, then why is it a
`-core` extension? It feels like it's a utility package, not an extension;
 - *jackson*. I can guess what it is but we should document it somewhere;
 - *join-library*. It is documented, but I think we should add more
documentation to explain how it works, maybe some caveats, and link to/from
the `CoGBK` section of the doc;
 - *protobuf*. I can probably guess what is it. Is 'extensions' the right
place for it though? We use this library in IOs (`PubsubsIO.readProtos()`),
should we move it to IO then? Same as with GCP extension, feels like a
utility library, not an extension;
 - *sketching*. No idea what to expect from this without reading the code;
 - *sorter*. Documented on the website;
 - *sql*. This looks familiar :) It is documented but not linked from the
extensions section, it's unclear whether it's the whole SQL or just some
related components;

[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

*Questions:*

 - should we minimally document (at least describe) all extensions and add
at least short readme.md's with the links to the Beam website?
 - is it a right thing to depend on `extensions` in other components like
IOs?
 - would it make sense to move some things out of 'extensions'? E.g. IO
components to IO or utility package, SQL into new DSLs package;

*Opinion:*

Maybe I am misunderstanding the intent and meaning of 'extensions', but
from my perspective:

 - I think that extensions should be more or less isolated from the Beam
SDK itself, so that if you delete or modify them, no Beam-internal changes
will be required (changes to something that's not an extension). And my
feeling is that they should provide value by themselves to users other than
SDK authors. They are called 'extensions', not 'critical components' or
'sdk utilities';

 - I don't think that IOs should depend on 'extensions'. Otherwise the
question is, is it ok for other components, like runners, to do the same?
Or even core?

 - I think there are few distinguishable classes of things in 'extensions'
right now:
     - collections of `PTransforms` with some business logic (Sorter, Join,
Sketch);
     - collections of `PTransforms` with focus parsing (Jackson, Protobuf);
     - DSLs; SQL DSL with more than just a few `PTransforms`, it can be
used almost as a standalone SDK. Things like Euphoria will probably end up
in the same class;
     - utility libraries shared by some parts of the SDK and unclear if
they are valuable by themselves to external users (Protobuf, GCP core);
   To me the business logic and parsing libraries do make sense to stay in
extensions, but probably under different subdirectories. I think it will
make sense to split others out of extensions into separate parts of the
SDK.

 - I think we should add readme.md's with short descriptions and links to
Beam website;

Thoughts, comments?


[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

Reply via email to