I too second that for existing users we should keep the same behavior. But
would like to get some clarity on what's the path towards unbundling
spark-avro. Or are we always going to have only bundled (hudi spark bundle
with spark-avro) artifacts in maven and for unbundled version, we are going
to ask devs to build hudi by their own, I don't think many would go that
route ever and will stick to the officially released artifacts. So, if we
have plans to eventually deprecate/stop bundling spark-avro, may be we need
to think through this.


On Tue, 8 Mar 2022 at 19:20, Y Ethan Guo <ethan.guoyi...@gmail.com> wrote:

> Thanks for raising the discussion.  I agree that from the usability
> standpoint from the user side, we should keep the same expectation
> regarding "--packages" for Spark and reliance bundled spark-avro for
> utilities bundle in this release.
>
> Given that there are Spark API changes between 3.2.0 and 3.2.1, do we also
> add Spark profiles for patch versions besides the latest, e.g. 3.2.0, as
> well?  If a user has Spark 3.2.0 in their environment, they have to upgrade
> both Hudi and Spark if they want to upgrade Hudi release.  Do we know if
> this is a major use case?
>
> Best,
> - Ethan
>
> On Tue, Mar 8, 2022 at 6:15 PM Vinoth Chandar <vin...@apache.org> wrote:
>
> > Thanks Alexey.
> >
> > This was actually the case for a while now, I think. From what I can see,
> > our quickstart for spark still suggests passing spark-avro in via
> > --packages, but utilities bundle related examples are relying on the fact
> > that this is pre-bundled.
> >
> > I do acknowledge that with recent Spark 3.x versions, breakages have
> become
> > much more frequent, amplifying this pain. However, to prevent jobs from
> > failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
> > job with the --packages flag), I would prefer if we actually kept the
> same
> > bundling behavior with the following simplifications.
> >
> > 1. We have three spark profiles now - spark2, spark3.1.x, and spark3
> > (3.2.1). We continue to bundle spark-avro and support the latest spark
> > minor version
> > 2. We retain and make the docs clearer about how users can "optionally"
> > unbundle and deploy for other versions.
> >
> > Given other large features going out, turned on by default this release,
> > not sure if its a good idea to introduce a breaking change like this.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin <ale...@onehouse.ai>
> wrote:
> >
> > > Hello, everyone!
> > >
> > > While working on HUDI-3549 <
> > > https://issues.apache.org/jira/browse/HUDI-3549>,
> > > we've surprisingly discovered that Hudi actually bundles "spark-avro"
> > > dependency *by default*.
> > >
> > > This is problematic b/c "spark-avro" is tightly coupled with some of
> the
> > > other Spark components making up its core distribution (ie being
> packaged
> > > in Spark itself, not an external packages, one example of that is
> > > "spark-sql")
> > >
> > > In regards to HUDI-3549
> > > <https://issues.apache.org/jira/browse/HUDI-3549> itself,
> > > the problem in there unfolded like following:
> > >
> > >    1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1
> > bundled
> > >    along with it
> > >    2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
> > >    3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/
> "spark-sql"
> > >    3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing
> typo
> > >    and renaming Internal API methods DataSourceUtils)
> > >
> > >
> > > To avoid this problems going forward, our proposal is to
> > >
> > >    1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
> > >    this means that Hudi users would need to now specify spark-avro via
> > >    `--packages` flag, since it's not part of Spark's core distribution)
> > >    2. (Optional) If community still sees value in bundling (and
> shading)
> > >    "spark-avro" in some cases, we can add Maven profile that would
> allow
> > > to do
> > >    that *ad hoc*.
> > >
> > > We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with
> the
> > > proposed changes.
> > >
> > > Looking forward to your feedback.
> > >
> >
>


-- 
Regards,
-Sivabalan

Reply via email to