Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Vinoth Chandar Thu, 13 Feb 2020 18:31:15 -0800

Just kicking this thread again, to make forward progress :)

On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]> wrote:


> First of all.. No apologies, no feeling bad.  We are all having fun here..
> :)
>
> I think we are all on the same page on the tradeoffs here.. let's see if
> we can decide one way or other.
>
> Bundling spark-avro has better user experience, one less package to
> remember adding. But even with the valid points raised by udit and hmatu, I
> was just worried about specific things in spark-avro that may not be
> compatible with the spark version.. Can someone analyze how coupled
> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
> different avro version than spark 2.4.4 and when hudi-spark-bundle is used
> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
> version?
>
> If someone can provide data points on the above and if we can convince
> ourselves that we can bundle a different spark-avro version (even
> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
> position. Otherwise, if we might face a barrage of support issues with
> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>
> TBH longer term, I am looking into if we can eliminate need for Row ->
> Avro conversion that we need spark-avro for. But lets ignore that for
> purposes of this discussion.
>
> Thanks
> Vinoth
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote:
>
>> Thanks for raising this! +1 to @Udit Mehrotra's point.
>>
>>
>>  It's right that recommend users to actually build their  own hudi jars,
>> with the spark version they use. It avoid the compatibility issues
>>
>> between user's local jars and pre-built hudi spark version(2.4.4).
>>
>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
>> local env will contains that external dependency if they use avro.
>>
>> If not, to run hudi(release-0.5.1) is more complex for me, when using
>> Delta Lake, it's more simpler:
>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
>>
>>
>>
>>
>>
>>
>>
>> ------------------&nbsp;Original&nbsp;------------------
>> From:&nbsp;"lamberken"<[email protected]&gt;;
>> Date:&nbsp;Thu, Feb 6, 2020 07:42 AM
>> To:&nbsp;"dev"<[email protected]&gt;;
>>
>> Subject:&nbsp;Re:[DISCUSS] Relocate spark-avro dependency by
>> maven-shade-plugin
>>
>>
>>
>>
>>
>> Dear team,
>>
>>
>> About this topic, there are some previous discussions in PR[1]. It's
>> better to visit it carefully before chiming in, thanks.
>>
>>
>> Current State:
>> Lamber-Ken: +1
>> Udit Mehrotra: +1
>> Bhavani Sudha: -1
>> Vinoth Chandar: -1
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>>
>> At 2020-02-06 06:10:52, "lamberken" <[email protected]&gt; wrote:
>> &gt;
>> &gt;
>> &gt;Dear team,
>> &gt;
>> &gt;
>> &gt;With the 0.5.1 version released, user need to add
>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like
>> bellow
>>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>> &gt;&nbsp; --packages
>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>> \
>> &gt;&nbsp; --conf
>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> &gt;
>> &gt;
>> &gt;From spark-avro-guide[1], we know that the spark-avro module is
>> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
>> &gt;So may it's better to relocate spark-avro dependency by using
>> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
>>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>> &gt;&nbsp; --packages
>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
>> &gt;&nbsp; --conf
>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> &gt;
>> &gt;
>> &gt;I created a pr to fix this[3], we may need have more discussion about
>> this, any suggestion is welcome, thanks very much :)
>> &gt;Current state:
>> &gt;@bhasudha : +1
>> &gt;@vinoth&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : -1
>> &gt;
>> &gt;
>> &gt;[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
>> &gt;[2]
>> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
>> &gt;[3] https://github.com/apache/incubator-hudi/pull/1290
>> &gt;
>
>

Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Reply via email to