Counter argument to the "in one box" thing.

I would like to point out that "having things in one box" is not a reason
to have the code residing in the same module/project.

What the user sees and how the code is structured are two very different
things in my opinion. Beam can certainly have modules developed at
different speeds and packaged "in one box" before the release. The Spring
Framework is a good example of that practice.

I would also like to show a current very simple example where the Beam user
experience is lacking and is unpredictable. In other words, where the
integration between Beam components is non-existent, even if everything is
currently "in one box". Consider this code:

var options = PipelineOptionsFactory.fromArgs(args).create();
var p = Pipeline.create(options);
p.getCoderRegistry().registerCoderForClass(FooAvroRecord.class,
SerializableCoder.of(FooAvroRecord.class));
var file =
App.class.getClassLoader().getResource("avro1.avro").toURI().getPath();
var read = p.apply(AvroIO
                .read(FooAvroRecord.class)
                .from(file)
        );
System.out.println("Using coder:" + read.getCoder());

Can you guess what coder this simple pipeline will output? If you guessed
SerializableCoder, you'd be wrong... it's "Using
coder:org.apache.beam.sdk.coders.AvroCoder@8b0f130", even if the user
explicitly specified the coder it wants to be used.

Going by the argument that there is better integration because "everything
is in one box", there shouldn't be this disconnect between AvroIO and the
CoderRegistry, but here we are.

There are countless examples of these user experiences issues that I can
provide.

Even more frustrating is that not only everything is in one box, but it's
mostly a **closed** box. A simple example, I want to extend the *Utils
(AvroUtils, POJOUtils, etc) so that their respective methods that return
Beam Schema or Schema Coder uses NanosInstant logical type for all
properties of java.time.Instant type because I don't use joda.time.Instant
anywhere in my code. Would be nice to override a given method or inject an
implementation that Bean internals will use or at least some configuration
based solution to achieve this. Yet, to my knowledge, that simply is not
possible right now, so things like the below are broken and very hard to
work around.

var options = PipelineOptionsFactory.fromArgs(args).create();
var p = Pipeline.create(options);
var file =
App.class.getClassLoader().getResource("avro1.avro").toURI().getPath();
var read = p.apply(AvroIO
                .read(FooAvroRecord.class)
                .withBeamSchemas(true)
                .from(file)
        );
System.out.println("Using coder:" + read.getCoder());

THis will crash and burn with the following:

Using coder:SchemaCoder<Schema: Fields:
Field{name=id, description=, type=STRING NOT NULL, options={{}}}
Field{name=someDate, description=, type=DATETIME NOT NULL, options={{}}}
Encoding positions:
{someDate=1, id=0}
Options:{{}}UUID: 2d857f41-7d02-43d6-9b8b-5e7411985aef  UUID:
2d857f41-7d02-43d6-9b8b-5e7411985aef delegateCoder:
org.apache.beam.sdk.coders.Coder$ByteBuddy$wgn7lZ1A@17ae7628
Exception in thread "main" java.lang.ClassCastException: class
java.time.Instant cannot be cast to class org.joda.time.Instant
(java.time.Instant is in module java.base of loader 'bootstrap';
org.joda.time.Instant is in unnamed module of loader 'app')
at org.apache.beam.sdk.coders.InstantCoder.encode(InstantCoder.java:32)

It's already a long email, but I would like to conclude by saying that in
my opinion, the Beam Java SDK project should prioritize the long term
architecture, code quality, and extensibility issues of the project in such
a way that it can be used in use cases that the Beam community thought
about, and those uses cases that they didn't. And if a group of Java SDK
Beam users absolutely needs backwards compatibility as a top priority, it
should be up to them to package a modularized Beam Java SDK to fit their
specific user needs.

On Wed., Dec. 14, 2022, 12:43 Ahmet Altay via dev, <dev@beam.apache.org>
wrote:

> I agree with Sachin. Keeping components that users will have to bring
> together anyway leads to a better user experience. Counter example to that
> is GCP libraries in my opinion. It was a frequent struggle for users to
> find a working set of libraries until there was a BOM. And even after the
> BOM it is still somewhat of a struggle for users and the developers of
> those various libraries need to take on some of the toil of testing those
> various libraries together anyway.
>
> re: Talk it with a grain of salt since I'm not even a committer - All
> inputs are welcome here. I do not think my comments should carry more
> weight just because I am a committer.
>
> On Wed, Dec 14, 2022 at 9:36 AM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
>
>> I strongly believe that we should continue to have Beam optimize for the
>> user - and while having separate components would allow those of us who are
>> contributors and committers move faster, the downsides of not having
>> everything "in one box" for a new user where the components are all
>> relatively guaranteed to work together at that version level are very high.
>>
>> Beam having everything included is absolutely a competitive advantage for
>> Beam and I would not want to lose that.
>>
>> On Wed, Dec 14, 2022 at 9:31 AM Byron Ellis via dev <dev@beam.apache.org>
>> wrote:
>>
>>> Talk it with a grain of salt since I'm not even a committer, but is
>>> perhaps the reorganization of Beam into smaller components the real work of
>>> a 3.0 effort? Splitting of Beam into smaller more independently managed
>>> components would be a pretty huge breaking change from a dependency
>>> management perspective which would potentially be largely separate from any
>>> code changes.
>>>
>>> Best,
>>> B
>>>
>>> On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko <
>>> aromanenko....@gmail.com> wrote:
>>>
>>>> On 12 Dec 2022, at 22:23, Robert Bradshaw via dev <dev@beam.apache.org>
>>>> wrote:
>>>>
>>>>
>>>> Saving up all the breaking changes until a major release definitely
>>>> has its downsides (look at Python 3). The migration path is often as
>>>> important (if not more so) than the final destination.
>>>>
>>>>
>>>> Actually, it proves that the major releases *should not* be delayed
>>>> for a long period of time and *should* be issued more often to reduce
>>>> the number of breaking changes (that, of course, likely may happen). That
>>>> will help users to do much more smooth and less risky upgrades, and
>>>> developers to not keep burden forever. Beam 2.0.0 was released back in may
>>>> 2017 and we've almost never talked about Beam 3.0 and what are the criteria
>>>> for it. I understand that it’s a completely different discussion but seems
>>>> that this time has come =)
>>>>
>>>> As for this particular change, I would question how the benefit (it's
>>>> unclear what the exact benefit is--better internal organization?)
>>>> exceeds the pain of making every user refactor their code. I think a
>>>> stronger case can be made for things like the Avro dependency that
>>>> cause real pain.
>>>>
>>>>
>>>> Agree. I think that if it doesn’t bring any pain with additional
>>>> external dependecies and this code is used in almost every other SDK
>>>> module, then there are no reasons for such breaking changes. On the other
>>>> hand, Avro case, that you mentioned above, is a good example why sometimes
>>>> it would be better to keep such code outside of “core”.
>>>>
>>>> As for the pipeline update feature, we've long discussed having
>>>> "pick-your-implementation" transforms that specify alternative,
>>>> equivalent implementations. Upgrades can choose the old one whereas
>>>> new pipelines can get the latest and greatest. It won't solve all
>>>> issues, and requires keeping old codepaths around, but could be an
>>>> important step forward.
>>>>
>>>> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles <k...@apache.org>
>>>> wrote:
>>>>
>>>>
>>>> I agree with Mortiz. To answer a few specifics in my own words:
>>>>
>>>> - It is a perfectly sensible refactor, but as a counterpoint without
>>>> file-based IO the SDK isn't functional so it is also a reasonable design
>>>> point to have this included. There are other things in the core SDK that
>>>> are far less "core" and could be moved out with greater benefit. The main
>>>> goal for any separation of modules would be lighter weight transitive
>>>> dependencies, IMO.
>>>>
>>>> - No, Beam has not made any deliberate breaking changes of this nature.
>>>> Hence we are still on major version 2. We have made some bugfixes for data
>>>> loss risks that could be called "breaking changes" but since the feature
>>>> was unsafe to use in the first place we did not bump the major version.
>>>>
>>>> - It is sometimes possible to do such a refactor and have the
>>>> deprecated location proxy to the new location. In this case that seems hard
>>>> to achieve.
>>>>
>>>> - It is not actually necessary to maintain both locations, as we can
>>>> declare the old location will be unmaintained (but left alone) and all new
>>>> development goes to the new location. That isn't a great choice for users
>>>> who may simply upgrade their SDK version and not notice that their old code
>>>> is now pointing at a version that will not receive e.g. security updates.
>>>>
>>>> - I like the style where if/when we transition from Beam 2 to Beam 3 we
>>>> should have the exact functionality of Beam 3 available as an opt-in flag
>>>> first. So if a user passes --beam-3 they get exactly what will be the
>>>> default functionality when we bump the major version. It really is a
>>>> problem to do a whole bunch of stuff feverishly before a major version
>>>> bump. The other style that I think works well is the linux kernel style
>>>> where major versions alternate between stable and unstable (in other words,
>>>> returning to the 0.x style with every alternating version).
>>>>
>>>> - I do think Beam suffers from fear and inability to do significant
>>>> code gardening. I don't think backwards compatibility in the code sense is
>>>> the biggest blocker. I think the "pipeline update" feature is perhaps the
>>>> thing most holding Beam back from making radical rapid forward progress.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack <mm...@talend.com> wrote:
>>>>
>>>>
>>>> Hi Damon,
>>>>
>>>>
>>>>
>>>> I fear the current release / versioning strategy of Beam doesn’t lend
>>>> itself well for such breaking changes. Alexey and I have spent quite some
>>>> time discussing how to proceed with the problematic Avro dependency in core
>>>> (and respectively AvroIO, of course).
>>>>
>>>> Such changes essentially always require duplicating code to continue
>>>> supporting a deprecated legacy code path to not break users’ code. But this
>>>> comes at a very high price. Until the deprecated code path can be finally
>>>> removed again, it must be maintained in two places.
>>>>
>>>> Unfortunately, the removal of deprecated code is rather problematic
>>>> without a major version release as it would break semantic versioning and
>>>> people’s expectations. With that deprecations bear the inherent risk to
>>>> unintentionally deplete quality rather than improving it.
>>>>
>>>> I’d therefore recommend against such efforts unless there’s very strong
>>>> reasons to do so.
>>>>
>>>>
>>>>
>>>> Best, Moritz
>>>>
>>>>
>>>>
>>>> On 07.12.22, 18:05, "Damon Douglas via dev" <dev@beam.apache.org>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Hello Everyone, If you identify yourself on the Beam learning journey,
>>>> even if this is your first day, please see yourself as a welcome
>>>> participant in this conversation and consider reviewing the bottom portion
>>>> of this email for guidance. The
>>>>
>>>> Hello Everyone,
>>>>
>>>>
>>>>
>>>> If you identify yourself on the Beam learning journey, even if this is
>>>> your first day, please see yourself as a welcome participant in this
>>>> conversation and consider reviewing the bottom portion of this email for
>>>> guidance.
>>>>
>>>>
>>>>
>>>> The Short Version (For those with Java Beam SDK knowledge):
>>>>
>>>>
>>>>
>>>> Should we migrate FileIO / TextIO and related classes from
>>>> :sdks:java:core to :sdks:java:io:file?  If so, should we target such a
>>>> migration to a future Beam version with repeated announcements?  Does the
>>>> Beam repository have any example of a similar change in the past?  What
>>>> learnings from said past change could be potentially applied to this one?
>>>>
>>>>
>>>>
>>>> The Long Version (For those on the learning path):
>>>>
>>>>
>>>>
>>>> This email is more about our repository organization rather than Beam.
>>>> The proposal is to move two highly used classes (and anything related) in
>>>> our Java SDK called FileIO [1] and TextIO [2].  The Beam GitHub repository
>>>> uses a software called gradle [3], to automate routine code tasks such as
>>>> build and test.  Gradle projects, such as Beam, organize code in what are
>>>> called modules [4].  The three main ingredients that make a module are 1) a
>>>> unique directory path, 2) a file called build.gradle (or build.gradle.kts)
>>>> in this directory, 3) referencing the gradle module in a settings.gradle
>>>> (or settings.gradle.kts) file at the root of the repository.
>>>>
>>>>
>>>>
>>>> The gradle documentation discusses why such organization might matter
>>>> and how to achieve this with large projects [5].  Essentially, modules
>>>> allow us to have mini-projects inside our large project and focus related
>>>> automations to this one focused portion of our larger repository.  In Beam,
>>>> we have the module :sdks:java:core [6] with all things related to the core
>>>> of Beam, whereas we have separate modules related to reading from and
>>>> writing to various resources within :sdks:java:io [7].
>>>>
>>>>
>>>>
>>>> The proposal suggests moving the aforementioned file reading and
>>>> writing classes, FileIO and TextIO, and anything related, to its own
>>>> :sdks:java:io:file module.  This would correspond to a new
>>>> sdks/java/io/file directory and moving these classes into
>>>> sdks/java/io/file/main/java/org/apache/beam/sdk/io/file.
>>>>
>>>>
>>>>
>>>> Definitions / References:
>>>>
>>>>
>>>>
>>>> 1. FileIO - a General-purpose transforms for working with files:
>>>> listing files (matching), reading and writing.  See -
>>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.html
>>>>
>>>>
>>>>
>>>> 2. TextIO - Similar to FileIO but focused on text files.  See
>>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.html
>>>>
>>>>
>>>>
>>>> 3. Gradle - a build automation tool used by the Apache Beam repository
>>>> to automate code-related tasks.  See
>>>> https://docs.gradle.org/current/userguide/what_is_gradle.html
>>>>
>>>>
>>>>
>>>> 4. Gradle Module - a subsection of your larger repository.  See
>>>> https://docs.gradle.org/current/userguide/dependency_management_terminology.html#sub:terminology_module
>>>>
>>>>
>>>>
>>>> 5. Structuring Large Projects with Gradle -
>>>> https://docs.gradle.org/current/userguide/structuring_software_products.html
>>>>
>>>>
>>>>
>>>> 6. sdks:java:core - Corresponds to the sdks/java/core repository
>>>> directory. See
>>>> https://github.com/apache/beam/tree/master/sdks/java/core
>>>>
>>>>
>>>>
>>>> 7. sdks:java:io - Corresponds to the sdks/java/io repository
>>>> directory.  See https://github.com/apache/beam/tree/master/sdks/java/io
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> Damon
>>>>
>>>>
>>>>
>>>> As a recipient of an email from Talend, your contact personal data will
>>>> be on our systems. Please see our privacy notice.
>>>>
>>>>
>>>>
>>>>
>>>>

Reply via email to