I don't think this is a SQL-specific problem. Beam (especially when using a variety of different IOs in a single pipeline) aggregates many dependencies into one binary, which sometimes creates this sort of pain. When users have organizational reasons to pin specific dependencies, then things get even worse. I still don't know if there is a perfect solution to all of this.
Reuven On Fri, Feb 15, 2019 at 7:42 PM Kenneth Knowles <[email protected]> wrote: > I'm not totally convinced Beam's dep versions are the issue here. A user > may have an organizational requirement of a particular version of, say, > Kafka and Hive. So when they depend on Beam they probably pin those > versions of Kafka and Hive which they have determined work together, and > they hope that the Beam IOs work together. > > I see this as a choice between two scenarios for users: > > 1. SQL <------- KafkaTable (@AutoService) ------> KafkaIO > ---provided-----> Kafka > 2. SQL (includes KafkaTable) ----optional----> KafkaIO -----provided-----> > Kakfa > > For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table, and > pin a version of Kafka > For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a > version of Kafka > > To be honest it is really hard to see which is preferable. I think number > 1 has fewer funky dependency edges, more simple "compile + runtime" > dependencies. > > Kenn > > > > > Kenn > > On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <[email protected]> > wrote: > >> I think the underlying problem is two modules of Beam transitively >> depending on conflicting dependencies (a.k.a. the diamond dependency >> problem) ? >> >> I think the general solution for this is two fold. (at least the way we >> have formulated in https://beam.apache.org/contribute/dependencies/) >> >> (1) Keep Beam dependencies as much as possible hoping that transitive >> dependencies stay compatible (we rely on semantic versioning here to not >> cause problems for differences in minor/patch versions. Might not be the >> case in practice for some dependencies). >> (2) For modules with outdated dependencies that we cannot upgrade due to >> some reason, we'll vendor those modules. >> >> Not sure if your specific problem need something more. >> >> Thanks, >> Cham >> >> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <[email protected]> wrote: >> >>> Hi dev@, >>> >>> I have a problem, I don't know a good way to approach the dependency >>> management between Beam SQL and Beam IOs, and want to collect thoughts >>> about it. >>> >>> Beam SQL depends on specific IOs so that users can query them. The IOs >>> need their dependencies to work. Sometimes the IOs also leak their >>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in >>> SQL we want to build abstractions on top of these IOs we risk having to >>> bundle the whole IOs or the leaked dependencies. Overall we can probably >>> avoid it by making the IOs `provided` dependencies, and by refactoring the >>> code that leaks. In this case things can be made to build, simple tests >>> will run, and we won't need to bundle the IOs within SQL. >>> >>> But as soon as there's a need to actually work with multiple IOs at the >>> same time the conflicts appear. For example, for testing of Hive/HCatalog >>> IOs in SQL we need to create an embedded Hive Metastore instance. It is a >>> very Hive-specific thing that requires its own dependencies that have to be >>> loaded during testing as part of SQL project. And some other IOs (e.g. >>> KafkaIO) can bring similar but conflicting dependencies which means that we >>> cannot easily work with or test both IOs at the same time within SQL. I >>> think it will become insane as number of IOs supported in SQL grows. >>> >>> So the question is how to avoid conflicts between IOs within SQL? >>> >>> One approach is to create separate packages for each of the SQL-specific >>> IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, >>> `beam-sdks-java-extensions-sql-kafka`, >>> etc. These projects will compile-depend on Beam SQL and on specific IO. >>> Beam SQL will load these either from user-specified configuration or >>> something like @AutoService at runtime. This way Beam SQL doesn't know >>> about the details of the IOs and their dependencies, and they can be easily >>> tested in isolation without conflicting with each other. This should also >>> be relatively simple to manage if things change, the build logic should be >>> straightforward and easy to update. On the negative side, each of the >>> projects will require its own separate build logic, it will not be easy to >>> test multiple IOs together within SQL, and users will have to manage the >>> conflicting dependencies by themselves. >>> >>> Another approach is to keep things roughly as they are but create >>> separate configurations within the main `build.gradle` in SQL project, >>> where configurations will correspond to separate IOs or use cases (e.g. >>> testing of Hive-related IOs). The benefit is that everything related to SQL >>> IOs stays roughly in one place (including build logic) and can be built and >>> tested together when possible. Negative side is that it will probably >>> involve some groovy magic and classpath manipulation within Gradle tasks to >>> make the configurations work, plus it may be brittle if we change our >>> top-level Beam build logic. And this approach also doesn't make it easier >>> for the users to manage the conflicts. >>> >>> Longer term we could probably also reduce the abstraction thickness on >>> top of the IOs, so that Beam SQL can work directly with IOs. For this to >>> work the supported IOs will need to expose things like `readRows()` and >>> get/set the schema on the PCollection. This is probably aligned with the >>> Schema work that's happening at the moment but I don't know whether it >>> makes sense to focus on this right now. The problem of the dependencies is >>> not solved here as well but I think it will be at least the same problem as >>> the users already have if they see conflicts when using mutliple IOs with >>> Beam pipelines.' >>> >>> Thoughts, ideas? Did anyone ever face a problem like this or am I >>> completely misunderstanding something in Beam build logic? >>> >>> Regards, >>> Anton >>> >>
