Hello, everyone!

While working on HUDI-3549 <https://issues.apache.org/jira/browse/HUDI-3549>,
we've surprisingly discovered that Hudi actually bundles "spark-avro"
dependency *by default*.

This is problematic b/c "spark-avro" is tightly coupled with some of the
other Spark components making up its core distribution (ie being packaged
in Spark itself, not an external packages, one example of that is
"spark-sql")

In regards to HUDI-3549
<https://issues.apache.org/jira/browse/HUDI-3549> itself,
the problem in there unfolded like following:

   1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1 bundled
   along with it
   2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
   3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
   3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
   and renaming Internal API methods DataSourceUtils)


To avoid this problems going forward, our proposal is to

   1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
   this means that Hudi users would need to now specify spark-avro via
   `--packages` flag, since it's not part of Spark's core distribution)
   2. (Optional) If community still sees value in bundling (and shading)
   "spark-avro" in some cases, we can add Maven profile that would allow to do
   that *ad hoc*.

We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with the
proposed changes.

Looking forward to your feedback.

Reply via email to