First of all.. No apologies, no feeling bad. We are all having fun here.. :)
I think we are all on the same page on the tradeoffs here.. let's see if we can decide one way or other. Bundling spark-avro has better user experience, one less package to remember adding. But even with the valid points raised by udit and hmatu, I was just worried about specific things in spark-avro that may not be compatible with the spark version.. Can someone analyze how coupled spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a different avro version than spark 2.4.4 and when hudi-spark-bundle is used in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro version? If someone can provide data points on the above and if we can convince ourselves that we can bundle a different spark-avro version (even spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my position. Otherwise, if we might face a barrage of support issues with NoClassDefFound /NoSuchMethodError etc, its not worth it IMO .. TBH longer term, I am looking into if we can eliminate need for Row -> Avro conversion that we need spark-avro for. But lets ignore that for purposes of this discussion. Thanks Vinoth On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote: > Thanks for raising this! +1 to @Udit Mehrotra's point. > > > It's right that recommend users to actually build their own hudi jars, > with the spark version they use. It avoid the compatibility issues > > between user's local jars and pre-built hudi spark version(2.4.4). > > Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user local > env will contains that external dependency if they use avro. > > If not, to run hudi(release-0.5.1) is more complex for me, when using > Delta Lake, it's more simpler: > just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0" > > > > > > > > ------------------ Original ------------------ > From: "lamberken"<[email protected]>; > Date: Thu, Feb 6, 2020 07:42 AM > To: "dev"<[email protected]>; > > Subject: Re:[DISCUSS] Relocate spark-avro dependency by > maven-shade-plugin > > > > > > Dear team, > > > About this topic, there are some previous discussions in PR[1]. It's > better to visit it carefully before chiming in, thanks. > > > Current State: > Lamber-Ken: +1 > Udit Mehrotra: +1 > Bhavani Sudha: -1 > Vinoth Chandar: -1 > > > Thanks, > Lamber-Ken > > > > At 2020-02-06 06:10:52, "lamberken" <[email protected]> wrote: > > > > > >Dear team, > > > > > >With the 0.5.1 version released, user need to add > `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like > bellow > > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > > --packages > org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 > \ > > --conf > 'spark.serializer=org.apache.spark.serializer.KryoSerializer' > > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > > > > > >From spark-avro-guide[1], we know that the spark-avro module is > external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz. > >So may it's better to relocate spark-avro dependency by using > maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does. > > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > > --packages > org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \ > > --conf > 'spark.serializer=org.apache.spark.serializer.KryoSerializer' > > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > > > > > >I created a pr to fix this[3], we may need have more discussion about > this, any suggestion is welcome, thanks very much :) > >Current state: > >@bhasudha : +1 > >@vinoth : -1 > > > > > >[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html > >[2] > http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz > >[3] https://github.com/apache/incubator-hudi/pull/1290 > >
