Just kicking this thread again, to make forward progress :) On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]> wrote:
> First of all.. No apologies, no feeling bad. We are all having fun here.. > :) > > I think we are all on the same page on the tradeoffs here.. let's see if > we can decide one way or other. > > Bundling spark-avro has better user experience, one less package to > remember adding. But even with the valid points raised by udit and hmatu, I > was just worried about specific things in spark-avro that may not be > compatible with the spark version.. Can someone analyze how coupled > spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a > different avro version than spark 2.4.4 and when hudi-spark-bundle is used > in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro > version? > > If someone can provide data points on the above and if we can convince > ourselves that we can bundle a different spark-avro version (even > spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my > position. Otherwise, if we might face a barrage of support issues with > NoClassDefFound /NoSuchMethodError etc, its not worth it IMO .. > > TBH longer term, I am looking into if we can eliminate need for Row -> > Avro conversion that we need spark-avro for. But lets ignore that for > purposes of this discussion. > > Thanks > Vinoth > > > > > > > > > > > > > On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote: > >> Thanks for raising this! +1 to @Udit Mehrotra's point. >> >> >> It's right that recommend users to actually build their own hudi jars, >> with the spark version they use. It avoid the compatibility issues >> >> between user's local jars and pre-built hudi spark version(2.4.4). >> >> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user >> local env will contains that external dependency if they use avro. >> >> If not, to run hudi(release-0.5.1) is more complex for me, when using >> Delta Lake, it's more simpler: >> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0" >> >> >> >> >> >> >> >> ------------------ Original ------------------ >> From: "lamberken"<[email protected]>; >> Date: Thu, Feb 6, 2020 07:42 AM >> To: "dev"<[email protected]>; >> >> Subject: Re:[DISCUSS] Relocate spark-avro dependency by >> maven-shade-plugin >> >> >> >> >> >> Dear team, >> >> >> About this topic, there are some previous discussions in PR[1]. It's >> better to visit it carefully before chiming in, thanks. >> >> >> Current State: >> Lamber-Ken: +1 >> Udit Mehrotra: +1 >> Bhavani Sudha: -1 >> Vinoth Chandar: -1 >> >> >> Thanks, >> Lamber-Ken >> >> >> >> At 2020-02-06 06:10:52, "lamberken" <[email protected]> wrote: >> > >> > >> >Dear team, >> > >> > >> >With the 0.5.1 version released, user need to add >> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like >> bellow >> >> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >> >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ >> > --packages >> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 >> \ >> > --conf >> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' >> >> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >> > >> > >> >From spark-avro-guide[1], we know that the spark-avro module is >> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz. >> >So may it's better to relocate spark-avro dependency by using >> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does. >> >> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >> >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ >> > --packages >> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \ >> > --conf >> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' >> >> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >> > >> > >> >I created a pr to fix this[3], we may need have more discussion about >> this, any suggestion is welcome, thanks very much :) >> >Current state: >> >@bhasudha : +1 >> >@vinoth : -1 >> > >> > >> >[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html >> >[2] >> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz >> >[3] https://github.com/apache/incubator-hudi/pull/1290 >> > > >
