Re: Dependency management for multiple IOs

Kenneth Knowles Mon, 18 Feb 2019 19:27:36 -0800

Yea. Beam SQL basically just has a clone of the Beam IO problem. So the
question is whether to mirror the module structure or to put all the
SQL-specific adapters in one module (or a few) with lots of optional
dependencies, and probably a complex build.gradle to support running a
variety of integration tests with different stuff on the classpath.


Kenn

On Mon, Feb 18, 2019 at 11:33 AM Reuven Lax <[email protected]> wrote:

> I don't think this is a SQL-specific problem. Beam (especially when using
> a variety of different IOs in a single pipeline) aggregates many
> dependencies into one binary, which sometimes creates this sort of pain.
> When users have organizational reasons to pin specific dependencies, then
> things get even worse. I still don't know if there is a perfect solution to
> all of this.
>
> Reuven
>
> On Fri, Feb 15, 2019 at 7:42 PM Kenneth Knowles <[email protected]> wrote:
>
>> I'm not totally convinced Beam's dep versions are the issue here. A user
>> may have an organizational requirement of a particular version of, say,
>> Kafka and Hive. So when they depend on Beam they probably pin those
>> versions of Kafka and Hive which they have determined work together, and
>> they hope that the Beam IOs work together.
>>
>> I see this as a choice between two scenarios for users:
>>
>> 1. SQL <------- KafkaTable (@AutoService) ------> KafkaIO
>> ---provided-----> Kafka
>> 2. SQL (includes KafkaTable) ----optional----> KafkaIO
>> -----provided-----> Kakfa
>>
>> For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table, and
>> pin a version of Kafka
>> For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a
>> version of Kafka
>>
>> To be honest it is really hard to see which is preferable. I think number
>> 1 has fewer funky dependency edges, more simple "compile + runtime"
>> dependencies.
>>
>> Kenn
>>
>>
>>
>>
>> Kenn
>>
>> On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>> I think the underlying problem is two modules of Beam transitively
>>> depending on conflicting dependencies (a.k.a. the diamond dependency
>>> problem) ?
>>>
>>> I think the general solution for this is two fold. (at least the way we
>>> have formulated in https://beam.apache.org/contribute/dependencies/)
>>>
>>> (1) Keep Beam dependencies as much as possible hoping that transitive
>>> dependencies stay compatible (we rely on semantic versioning here to not
>>> cause problems for differences in minor/patch versions. Might not be the
>>> case in practice for some dependencies).
>>> (2) For modules with outdated dependencies that we cannot upgrade due to
>>> some reason, we'll vendor those modules.
>>>
>>> Not sure if your specific problem need something more.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <[email protected]> wrote:
>>>
>>>> Hi dev@,
>>>>
>>>> I have a problem, I don't know a good way to approach the dependency
>>>> management between Beam SQL and Beam IOs, and want to collect thoughts
>>>> about it.
>>>>
>>>> Beam SQL depends on specific IOs so that users can query them. The IOs
>>>> need their dependencies to work. Sometimes the IOs also leak their
>>>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in
>>>> SQL we want to build abstractions on top of these IOs we risk having to
>>>> bundle the whole IOs or the leaked dependencies. Overall we can probably
>>>> avoid it by making the IOs `provided` dependencies, and by refactoring the
>>>> code that leaks. In this case things can be made to build, simple tests
>>>> will run, and we won't need to bundle the IOs within SQL.
>>>>
>>>> But as soon as there's a need to actually work with multiple IOs at the
>>>> same time the conflicts appear. For example, for testing of Hive/HCatalog
>>>> IOs in SQL we need to create an embedded Hive Metastore instance. It is a
>>>> very Hive-specific thing that requires its own dependencies that have to be
>>>> loaded during testing as part of SQL project. And some other IOs (e.g.
>>>> KafkaIO) can bring similar but conflicting dependencies which means that we
>>>> cannot easily work with or test both IOs at the same time within SQL. I
>>>> think it will become insane as number of IOs supported in SQL grows.
>>>>
>>>> So the question is how to avoid conflicts between IOs within SQL?
>>>>
>>>> One approach is to create separate packages for each of the
>>>> SQL-specific IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, 
>>>> `beam-sdks-java-extensions-sql-kafka`,
>>>> etc. These projects will compile-depend on Beam SQL and on specific IO.
>>>> Beam SQL will load these either from user-specified configuration or
>>>> something like @AutoService at runtime. This way Beam SQL doesn't know
>>>> about the details of the IOs and their dependencies, and they can be easily
>>>> tested in isolation without conflicting with each other. This should also
>>>> be relatively simple to manage if things change, the build logic should be
>>>> straightforward and easy to update. On the negative side, each of the
>>>> projects will require its own separate build logic, it will not be easy to
>>>> test multiple IOs together within SQL, and users will have to manage the
>>>> conflicting dependencies by themselves.
>>>>
>>>> Another approach is to keep things roughly as they are but create
>>>> separate configurations within the main `build.gradle` in SQL project,
>>>> where configurations will correspond to separate IOs or use cases (e.g.
>>>> testing of Hive-related IOs). The benefit is that everything related to SQL
>>>> IOs stays roughly in one place (including build logic) and can be built and
>>>> tested together when possible. Negative side is that it will probably
>>>> involve some groovy magic and classpath manipulation within Gradle tasks to
>>>> make the configurations work, plus it may be brittle if we change our
>>>> top-level Beam build logic. And this approach also doesn't make it easier
>>>> for the users to manage the conflicts.
>>>>
>>>> Longer term we could probably also reduce the abstraction thickness on
>>>> top of the IOs, so that Beam SQL can work directly with IOs. For this to
>>>> work the supported IOs will need to expose things like `readRows()` and
>>>> get/set the schema on the PCollection. This is probably aligned with the
>>>> Schema work that's happening at the moment but I don't know whether it
>>>> makes sense to focus on this right now. The problem of the dependencies is
>>>> not solved here as well but I think it will be at least the same problem as
>>>> the users already have if they see conflicts when using mutliple IOs with
>>>> Beam pipelines.'
>>>>
>>>> Thoughts, ideas? Did anyone ever face a problem like this or am I
>>>> completely misunderstanding something in  Beam build logic?
>>>>
>>>> Regards,
>>>> Anton
>>>>
>>>

Re: Dependency management for multiple IOs

Reply via email to