Re: Dependency management for multiple IOs

Reuven Lax Mon, 18 Feb 2019 11:33:35 -0800

I don't think this is a SQL-specific problem. Beam (especially when using a
variety of different IOs in a single pipeline) aggregates many dependencies
into one binary, which sometimes creates this sort of pain. When users have
organizational reasons to pin specific dependencies, then things get even
worse. I still don't know if there is a perfect solution to all of this.


Reuven

On Fri, Feb 15, 2019 at 7:42 PM Kenneth Knowles <[email protected]> wrote:

> I'm not totally convinced Beam's dep versions are the issue here. A user
> may have an organizational requirement of a particular version of, say,
> Kafka and Hive. So when they depend on Beam they probably pin those
> versions of Kafka and Hive which they have determined work together, and
> they hope that the Beam IOs work together.
>
> I see this as a choice between two scenarios for users:
>
> 1. SQL <------- KafkaTable (@AutoService) ------> KafkaIO
> ---provided-----> Kafka
> 2. SQL (includes KafkaTable) ----optional----> KafkaIO -----provided----->
> Kakfa
>
> For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table, and
> pin a version of Kafka
> For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a
> version of Kafka
>
> To be honest it is really hard to see which is preferable. I think number
> 1 has fewer funky dependency edges, more simple "compile + runtime"
> dependencies.
>
> Kenn
>
>
>
>
> Kenn
>
> On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <[email protected]>
> wrote:
>
>> I think the underlying problem is two modules of Beam transitively
>> depending on conflicting dependencies (a.k.a. the diamond dependency
>> problem) ?
>>
>> I think the general solution for this is two fold. (at least the way we
>> have formulated in https://beam.apache.org/contribute/dependencies/)
>>
>> (1) Keep Beam dependencies as much as possible hoping that transitive
>> dependencies stay compatible (we rely on semantic versioning here to not
>> cause problems for differences in minor/patch versions. Might not be the
>> case in practice for some dependencies).
>> (2) For modules with outdated dependencies that we cannot upgrade due to
>> some reason, we'll vendor those modules.
>>
>> Not sure if your specific problem need something more.
>>
>> Thanks,
>> Cham
>>
>> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <[email protected]> wrote:
>>
>>> Hi dev@,
>>>
>>> I have a problem, I don't know a good way to approach the dependency
>>> management between Beam SQL and Beam IOs, and want to collect thoughts
>>> about it.
>>>
>>> Beam SQL depends on specific IOs so that users can query them. The IOs
>>> need their dependencies to work. Sometimes the IOs also leak their
>>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in
>>> SQL we want to build abstractions on top of these IOs we risk having to
>>> bundle the whole IOs or the leaked dependencies. Overall we can probably
>>> avoid it by making the IOs `provided` dependencies, and by refactoring the
>>> code that leaks. In this case things can be made to build, simple tests
>>> will run, and we won't need to bundle the IOs within SQL.
>>>
>>> But as soon as there's a need to actually work with multiple IOs at the
>>> same time the conflicts appear. For example, for testing of Hive/HCatalog
>>> IOs in SQL we need to create an embedded Hive Metastore instance. It is a
>>> very Hive-specific thing that requires its own dependencies that have to be
>>> loaded during testing as part of SQL project. And some other IOs (e.g.
>>> KafkaIO) can bring similar but conflicting dependencies which means that we
>>> cannot easily work with or test both IOs at the same time within SQL. I
>>> think it will become insane as number of IOs supported in SQL grows.
>>>
>>> So the question is how to avoid conflicts between IOs within SQL?
>>>
>>> One approach is to create separate packages for each of the SQL-specific
>>> IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, 
>>> `beam-sdks-java-extensions-sql-kafka`,
>>> etc. These projects will compile-depend on Beam SQL and on specific IO.
>>> Beam SQL will load these either from user-specified configuration or
>>> something like @AutoService at runtime. This way Beam SQL doesn't know
>>> about the details of the IOs and their dependencies, and they can be easily
>>> tested in isolation without conflicting with each other. This should also
>>> be relatively simple to manage if things change, the build logic should be
>>> straightforward and easy to update. On the negative side, each of the
>>> projects will require its own separate build logic, it will not be easy to
>>> test multiple IOs together within SQL, and users will have to manage the
>>> conflicting dependencies by themselves.
>>>
>>> Another approach is to keep things roughly as they are but create
>>> separate configurations within the main `build.gradle` in SQL project,
>>> where configurations will correspond to separate IOs or use cases (e.g.
>>> testing of Hive-related IOs). The benefit is that everything related to SQL
>>> IOs stays roughly in one place (including build logic) and can be built and
>>> tested together when possible. Negative side is that it will probably
>>> involve some groovy magic and classpath manipulation within Gradle tasks to
>>> make the configurations work, plus it may be brittle if we change our
>>> top-level Beam build logic. And this approach also doesn't make it easier
>>> for the users to manage the conflicts.
>>>
>>> Longer term we could probably also reduce the abstraction thickness on
>>> top of the IOs, so that Beam SQL can work directly with IOs. For this to
>>> work the supported IOs will need to expose things like `readRows()` and
>>> get/set the schema on the PCollection. This is probably aligned with the
>>> Schema work that's happening at the moment but I don't know whether it
>>> makes sense to focus on this right now. The problem of the dependencies is
>>> not solved here as well but I think it will be at least the same problem as
>>> the users already have if they see conflicts when using mutliple IOs with
>>> Beam pipelines.'
>>>
>>> Thoughts, ideas? Did anyone ever face a problem like this or am I
>>> completely misunderstanding something in  Beam build logic?
>>>
>>> Regards,
>>> Anton
>>>
>>

Re: Dependency management for multiple IOs

Reply via email to