Thanks for raising the discussion.  I agree that from the usability
standpoint from the user side, we should keep the same expectation
regarding "--packages" for Spark and reliance bundled spark-avro for
utilities bundle in this release.

Given that there are Spark API changes between 3.2.0 and 3.2.1, do we also
add Spark profiles for patch versions besides the latest, e.g. 3.2.0, as
well?  If a user has Spark 3.2.0 in their environment, they have to upgrade
both Hudi and Spark if they want to upgrade Hudi release.  Do we know if
this is a major use case?

Best,
- Ethan

On Tue, Mar 8, 2022 at 6:15 PM Vinoth Chandar <vin...@apache.org> wrote:

> Thanks Alexey.
>
> This was actually the case for a while now, I think. From what I can see,
> our quickstart for spark still suggests passing spark-avro in via
> --packages, but utilities bundle related examples are relying on the fact
> that this is pre-bundled.
>
> I do acknowledge that with recent Spark 3.x versions, breakages have become
> much more frequent, amplifying this pain. However, to prevent jobs from
> failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
> job with the --packages flag), I would prefer if we actually kept the same
> bundling behavior with the following simplifications.
>
> 1. We have three spark profiles now - spark2, spark3.1.x, and spark3
> (3.2.1). We continue to bundle spark-avro and support the latest spark
> minor version
> 2. We retain and make the docs clearer about how users can "optionally"
> unbundle and deploy for other versions.
>
> Given other large features going out, turned on by default this release,
> not sure if its a good idea to introduce a breaking change like this.
>
> Thanks
> Vinoth
>
> On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin <ale...@onehouse.ai> wrote:
>
> > Hello, everyone!
> >
> > While working on HUDI-3549 <
> > https://issues.apache.org/jira/browse/HUDI-3549>,
> > we've surprisingly discovered that Hudi actually bundles "spark-avro"
> > dependency *by default*.
> >
> > This is problematic b/c "spark-avro" is tightly coupled with some of the
> > other Spark components making up its core distribution (ie being packaged
> > in Spark itself, not an external packages, one example of that is
> > "spark-sql")
> >
> > In regards to HUDI-3549
> > <https://issues.apache.org/jira/browse/HUDI-3549> itself,
> > the problem in there unfolded like following:
> >
> >    1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1
> bundled
> >    along with it
> >    2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
> >    3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
> >    3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
> >    and renaming Internal API methods DataSourceUtils)
> >
> >
> > To avoid this problems going forward, our proposal is to
> >
> >    1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
> >    this means that Hudi users would need to now specify spark-avro via
> >    `--packages` flag, since it's not part of Spark's core distribution)
> >    2. (Optional) If community still sees value in bundling (and shading)
> >    "spark-avro" in some cases, we can add Maven profile that would allow
> > to do
> >    that *ad hoc*.
> >
> > We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with the
> > proposed changes.
> >
> > Looking forward to your feedback.
> >
>

Reply via email to