Re:Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

lamberken Sat, 15 Feb 2020 22:55:40 -0800


Hi @Vinoth, sorry delay for ensure the following analysis is correct



In hudi project, spark-avro module is only used for converting between spark's 
struct type and avro schema, only used two methods 
`SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two 
methods are in `org.apache.spark.sql.avro.SchemaConverters` class.


Analyse:
1, the `SchemaConverters` class are same in spark-master[1] and branch-3.0[2].
2, from the import statements in `SchemaConverters`, we can learn that 
`SchemaConverters` doesn't depend on
   other class in spark-avro module. 
   Also, I tried to move it hudi project and use a different package, compile 
go though.


Use the hudi jar with shaded spark-avro module:
1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)


So, if we shade the spark-avro is safe and will has better user experience, and 
we needn't shade it when spark-avro module is not external in spark project. 


Thanks,
Lamber-Ken


[1] 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
[2] 
https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala







At 2020-02-14 10:30:35, "Vinoth Chandar" <[email protected]> wrote:
>Just kicking this thread again, to make forward progress :)
>
>On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]> wrote:
>
>> First of all.. No apologies, no feeling bad.  We are all having fun here..
>> :)
>>
>> I think we are all on the same page on the tradeoffs here.. let's see if
>> we can decide one way or other.
>>
>> Bundling spark-avro has better user experience, one less package to
>> remember adding. But even with the valid points raised by udit and hmatu, I
>> was just worried about specific things in spark-avro that may not be
>> compatible with the spark version.. Can someone analyze how coupled
>> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
>> different avro version than spark 2.4.4 and when hudi-spark-bundle is used
>> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
>> version?
>>
>> If someone can provide data points on the above and if we can convince
>> ourselves that we can bundle a different spark-avro version (even
>> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
>> position. Otherwise, if we might face a barrage of support issues with
>> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>>
>> TBH longer term, I am looking into if we can eliminate need for Row ->
>> Avro conversion that we need spark-avro for. But lets ignore that for
>> purposes of this discussion.
>>
>> Thanks
>> Vinoth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote:
>>
>>> Thanks for raising this! +1 to @Udit Mehrotra's point.
>>>
>>>
>>>  It's right that recommend users to actually build their  own hudi jars,
>>> with the spark version they use. It avoid the compatibility issues
>>>
>>> between user's local jars and pre-built hudi spark version(2.4.4).
>>>
>>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
>>> local env will contains that external dependency if they use avro.
>>>
>>> If not, to run hudi(release-0.5.1) is more complex for me, when using
>>> Delta Lake, it's more simpler:
>>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------&nbsp;Original&nbsp;------------------
>>> From:&nbsp;"lamberken"<[email protected]&gt;;
>>> Date:&nbsp;Thu, Feb 6, 2020 07:42 AM
>>> To:&nbsp;"dev"<[email protected]&gt;;
>>>
>>> Subject:&nbsp;Re:[DISCUSS] Relocate spark-avro dependency by
>>> maven-shade-plugin
>>>
>>>
>>>
>>>
>>>
>>> Dear team,
>>>
>>>
>>> About this topic, there are some previous discussions in PR[1]. It's
>>> better to visit it carefully before chiming in, thanks.
>>>
>>>
>>> Current State:
>>> Lamber-Ken: +1
>>> Udit Mehrotra: +1
>>> Bhavani Sudha: -1
>>> Vinoth Chandar: -1
>>>
>>>
>>> Thanks,
>>> Lamber-Ken
>>>
>>>
>>>
>>> At 2020-02-06 06:10:52, "lamberken" <[email protected]&gt; wrote:
>>> &gt;
>>> &gt;
>>> &gt;Dear team,
>>> &gt;
>>> &gt;
>>> &gt;With the 0.5.1 version released, user need to add
>>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like
>>> bellow
>>>
>>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>>> &gt;&nbsp; --packages
>>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>>> \
>>> &gt;&nbsp; --conf
>>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>>>
>>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>>> &gt;
>>> &gt;
>>> &gt;From spark-avro-guide[1], we know that the spark-avro module is
>>> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
>>> &gt;So may it's better to relocate spark-avro dependency by using
>>> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
>>>
>>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>>> &gt;&nbsp; --packages
>>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
>>> &gt;&nbsp; --conf
>>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>>>
>>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>>> &gt;
>>> &gt;
>>> &gt;I created a pr to fix this[3], we may need have more discussion about
>>> this, any suggestion is welcome, thanks very much :)
>>> &gt;Current state:
>>> &gt;@bhasudha : +1
>>> &gt;@vinoth&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : -1
>>> &gt;
>>> &gt;
>>> &gt;[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
>>> &gt;[2]
>>> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
>>> &gt;[3] https://github.com/apache/incubator-hudi/pull/1290
>>> &gt;
>>
>>

Re:Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Reply via email to