Thanks for raising this! +1 to @Udit Mehrotra's point.
It's right that recommend users to actually build their own hudi jars, with the spark version they use. It avoid the compatibility issues between user's local jars and pre-built hudi spark version(2.4.4). Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user local env will contains that external dependency if they use avro. If not, to run hudi(release-0.5.1) is more complex for me, when using Delta Lake, it's more simpler: just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0" ------------------ Original ------------------ From: "lamberken"<[email protected]>; Date: Thu, Feb 6, 2020 07:42 AM To: "dev"<[email protected]>; Subject: Re:[DISCUSS] Relocate spark-avro dependency by maven-shade-plugin Dear team, About this topic, there are some previous discussions in PR[1]. It's better to visit it carefully before chiming in, thanks. Current State: Lamber-Ken: +1 Udit Mehrotra: +1 Bhavani Sudha: -1 Vinoth Chandar: -1 Thanks, Lamber-Ken At 2020-02-06 06:10:52, "lamberken" <[email protected]> wrote: > > >Dear team, > > >With the 0.5.1 version released, user need to add `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like bellow >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > > >From spark-avro-guide[1], we know that the spark-avro module is external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz. >So may it's better to relocate spark-avro dependency by using maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does. >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > > >I created a pr to fix this[3], we may need have more discussion about this, any suggestion is welcome, thanks very much :) >Current state: >@bhasudha : +1 >@vinoth : -1 > > >[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html >[2] http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz >[3] https://github.com/apache/incubator-hudi/pull/1290 >
