Yea. Beam SQL basically just has a clone of the Beam IO problem. So the question is whether to mirror the module structure or to put all the SQL-specific adapters in one module (or a few) with lots of optional dependencies, and probably a complex build.gradle to support running a variety of integration tests with different stuff on the classpath.
Kenn On Mon, Feb 18, 2019 at 11:33 AM Reuven Lax <[email protected]> wrote: > I don't think this is a SQL-specific problem. Beam (especially when using > a variety of different IOs in a single pipeline) aggregates many > dependencies into one binary, which sometimes creates this sort of pain. > When users have organizational reasons to pin specific dependencies, then > things get even worse. I still don't know if there is a perfect solution to > all of this. > > Reuven > > On Fri, Feb 15, 2019 at 7:42 PM Kenneth Knowles <[email protected]> wrote: > >> I'm not totally convinced Beam's dep versions are the issue here. A user >> may have an organizational requirement of a particular version of, say, >> Kafka and Hive. So when they depend on Beam they probably pin those >> versions of Kafka and Hive which they have determined work together, and >> they hope that the Beam IOs work together. >> >> I see this as a choice between two scenarios for users: >> >> 1. SQL <------- KafkaTable (@AutoService) ------> KafkaIO >> ---provided-----> Kafka >> 2. SQL (includes KafkaTable) ----optional----> KafkaIO >> -----provided-----> Kakfa >> >> For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table, and >> pin a version of Kafka >> For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a >> version of Kafka >> >> To be honest it is really hard to see which is preferable. I think number >> 1 has fewer funky dependency edges, more simple "compile + runtime" >> dependencies. >> >> Kenn >> >> >> >> >> Kenn >> >> On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <[email protected]> >> wrote: >> >>> I think the underlying problem is two modules of Beam transitively >>> depending on conflicting dependencies (a.k.a. the diamond dependency >>> problem) ? >>> >>> I think the general solution for this is two fold. (at least the way we >>> have formulated in https://beam.apache.org/contribute/dependencies/) >>> >>> (1) Keep Beam dependencies as much as possible hoping that transitive >>> dependencies stay compatible (we rely on semantic versioning here to not >>> cause problems for differences in minor/patch versions. Might not be the >>> case in practice for some dependencies). >>> (2) For modules with outdated dependencies that we cannot upgrade due to >>> some reason, we'll vendor those modules. >>> >>> Not sure if your specific problem need something more. >>> >>> Thanks, >>> Cham >>> >>> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <[email protected]> wrote: >>> >>>> Hi dev@, >>>> >>>> I have a problem, I don't know a good way to approach the dependency >>>> management between Beam SQL and Beam IOs, and want to collect thoughts >>>> about it. >>>> >>>> Beam SQL depends on specific IOs so that users can query them. The IOs >>>> need their dependencies to work. Sometimes the IOs also leak their >>>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in >>>> SQL we want to build abstractions on top of these IOs we risk having to >>>> bundle the whole IOs or the leaked dependencies. Overall we can probably >>>> avoid it by making the IOs `provided` dependencies, and by refactoring the >>>> code that leaks. In this case things can be made to build, simple tests >>>> will run, and we won't need to bundle the IOs within SQL. >>>> >>>> But as soon as there's a need to actually work with multiple IOs at the >>>> same time the conflicts appear. For example, for testing of Hive/HCatalog >>>> IOs in SQL we need to create an embedded Hive Metastore instance. It is a >>>> very Hive-specific thing that requires its own dependencies that have to be >>>> loaded during testing as part of SQL project. And some other IOs (e.g. >>>> KafkaIO) can bring similar but conflicting dependencies which means that we >>>> cannot easily work with or test both IOs at the same time within SQL. I >>>> think it will become insane as number of IOs supported in SQL grows. >>>> >>>> So the question is how to avoid conflicts between IOs within SQL? >>>> >>>> One approach is to create separate packages for each of the >>>> SQL-specific IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, >>>> `beam-sdks-java-extensions-sql-kafka`, >>>> etc. These projects will compile-depend on Beam SQL and on specific IO. >>>> Beam SQL will load these either from user-specified configuration or >>>> something like @AutoService at runtime. This way Beam SQL doesn't know >>>> about the details of the IOs and their dependencies, and they can be easily >>>> tested in isolation without conflicting with each other. This should also >>>> be relatively simple to manage if things change, the build logic should be >>>> straightforward and easy to update. On the negative side, each of the >>>> projects will require its own separate build logic, it will not be easy to >>>> test multiple IOs together within SQL, and users will have to manage the >>>> conflicting dependencies by themselves. >>>> >>>> Another approach is to keep things roughly as they are but create >>>> separate configurations within the main `build.gradle` in SQL project, >>>> where configurations will correspond to separate IOs or use cases (e.g. >>>> testing of Hive-related IOs). The benefit is that everything related to SQL >>>> IOs stays roughly in one place (including build logic) and can be built and >>>> tested together when possible. Negative side is that it will probably >>>> involve some groovy magic and classpath manipulation within Gradle tasks to >>>> make the configurations work, plus it may be brittle if we change our >>>> top-level Beam build logic. And this approach also doesn't make it easier >>>> for the users to manage the conflicts. >>>> >>>> Longer term we could probably also reduce the abstraction thickness on >>>> top of the IOs, so that Beam SQL can work directly with IOs. For this to >>>> work the supported IOs will need to expose things like `readRows()` and >>>> get/set the schema on the PCollection. This is probably aligned with the >>>> Schema work that's happening at the moment but I don't know whether it >>>> makes sense to focus on this right now. The problem of the dependencies is >>>> not solved here as well but I think it will be at least the same problem as >>>> the users already have if they see conflicts when using mutliple IOs with >>>> Beam pipelines.' >>>> >>>> Thoughts, ideas? Did anyone ever face a problem like this or am I >>>> completely misunderstanding something in Beam build logic? >>>> >>>> Regards, >>>> Anton >>>> >>>
