Hi dev@,

I have a problem, I don't know a good way to approach the dependency
management between Beam SQL and Beam IOs, and want to collect thoughts
about it.

Beam SQL depends on specific IOs so that users can query them. The IOs need
their dependencies to work. Sometimes the IOs also leak their transitive
dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in SQL we want
to build abstractions on top of these IOs we risk having to bundle the
whole IOs or the leaked dependencies. Overall we can probably avoid it by
making the IOs `provided` dependencies, and by refactoring the code that
leaks. In this case things can be made to build, simple tests will run, and
we won't need to bundle the IOs within SQL.

But as soon as there's a need to actually work with multiple IOs at the
same time the conflicts appear. For example, for testing of Hive/HCatalog
IOs in SQL we need to create an embedded Hive Metastore instance. It is a
very Hive-specific thing that requires its own dependencies that have to be
loaded during testing as part of SQL project. And some other IOs (e.g.
KafkaIO) can bring similar but conflicting dependencies which means that we
cannot easily work with or test both IOs at the same time within SQL. I
think it will become insane as number of IOs supported in SQL grows.

So the question is how to avoid conflicts between IOs within SQL?

One approach is to create separate packages for each of the SQL-specific IO
wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`,
`beam-sdks-java-extensions-sql-kafka`,
etc. These projects will compile-depend on Beam SQL and on specific IO.
Beam SQL will load these either from user-specified configuration or
something like @AutoService at runtime. This way Beam SQL doesn't know
about the details of the IOs and their dependencies, and they can be easily
tested in isolation without conflicting with each other. This should also
be relatively simple to manage if things change, the build logic should be
straightforward and easy to update. On the negative side, each of the
projects will require its own separate build logic, it will not be easy to
test multiple IOs together within SQL, and users will have to manage the
conflicting dependencies by themselves.

Another approach is to keep things roughly as they are but create separate
configurations within the main `build.gradle` in SQL project, where
configurations will correspond to separate IOs or use cases (e.g. testing
of Hive-related IOs). The benefit is that everything related to SQL IOs
stays roughly in one place (including build logic) and can be built and
tested together when possible. Negative side is that it will probably
involve some groovy magic and classpath manipulation within Gradle tasks to
make the configurations work, plus it may be brittle if we change our
top-level Beam build logic. And this approach also doesn't make it easier
for the users to manage the conflicts.

Longer term we could probably also reduce the abstraction thickness on top
of the IOs, so that Beam SQL can work directly with IOs. For this to work
the supported IOs will need to expose things like `readRows()` and get/set
the schema on the PCollection. This is probably aligned with the Schema
work that's happening at the moment but I don't know whether it makes sense
to focus on this right now. The problem of the dependencies is not solved
here as well but I think it will be at least the same problem as the users
already have if they see conflicts when using mutliple IOs with Beam
pipelines.'

Thoughts, ideas? Did anyone ever face a problem like this or am I
completely misunderstanding something in  Beam build logic?

Regards,
Anton

Reply via email to