Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Vinoth Chandar Thu, 06 Feb 2020 10:47:18 -0800

First of all.. No apologies, no feeling bad.  We are all having fun here..
:)


I think we are all on the same page on the tradeoffs here.. let's see if we
can decide one way or other.

Bundling spark-avro has better user experience, one less package to
remember adding. But even with the valid points raised by udit and hmatu, I
was just worried about specific things in spark-avro that may not be
compatible with the spark version.. Can someone analyze how coupled
spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
different avro version than spark 2.4.4 and when hudi-spark-bundle is used
in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
version?

If someone can provide data points on the above and if we can convince
ourselves that we can bundle a different spark-avro version (even
spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
position. Otherwise, if we might face a barrage of support issues with
NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..

TBH longer term, I am looking into if we can eliminate need for Row -> Avro
conversion that we need spark-avro for. But lets ignore that for purposes
of this discussion.

Thanks
Vinoth












On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote:

> Thanks for raising this! +1 to @Udit Mehrotra's point.
>
>
>  It's right that recommend users to actually build their  own hudi jars,
> with the spark version they use. It avoid the compatibility issues
>
> between user's local jars and pre-built hudi spark version(2.4.4).
>
> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user local
> env will contains that external dependency if they use avro.
>
> If not, to run hudi(release-0.5.1) is more complex for me, when using
> Delta Lake, it's more simpler:
> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
>
>
>
>
>
>
>
> ------------------&nbsp;Original&nbsp;------------------
> From:&nbsp;"lamberken"<[email protected]&gt;;
> Date:&nbsp;Thu, Feb 6, 2020 07:42 AM
> To:&nbsp;"dev"<[email protected]&gt;;
>
> Subject:&nbsp;Re:[DISCUSS] Relocate spark-avro dependency by
> maven-shade-plugin
>
>
>
>
>
> Dear team,
>
>
> About this topic, there are some previous discussions in PR[1]. It's
> better to visit it carefully before chiming in, thanks.
>
>
> Current State:
> Lamber-Ken: +1
> Udit Mehrotra: +1
> Bhavani Sudha: -1
> Vinoth Chandar: -1
>
>
> Thanks,
> Lamber-Ken
>
>
>
> At 2020-02-06 06:10:52, "lamberken" <[email protected]&gt; wrote:
> &gt;
> &gt;
> &gt;Dear team,
> &gt;
> &gt;
> &gt;With the 0.5.1 version released, user need to add
> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like
> bellow
>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
> &gt;&nbsp; --packages
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
> \
> &gt;&nbsp; --conf
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> &gt;
> &gt;
> &gt;From spark-avro-guide[1], we know that the spark-avro module is
> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
> &gt;So may it's better to relocate spark-avro dependency by using
> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
> &gt;&nbsp; --packages
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
> &gt;&nbsp; --conf
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> &gt;
> &gt;
> &gt;I created a pr to fix this[3], we may need have more discussion about
> this, any suggestion is welcome, thanks very much :)
> &gt;Current state:
> &gt;@bhasudha : +1
> &gt;@vinoth&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : -1
> &gt;
> &gt;
> &gt;[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
> &gt;[2]
> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> &gt;[3] https://github.com/apache/incubator-hudi/pull/1290
> &gt;

Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Reply via email to