Hi @Vinoth, sorry delay for ensure the following analysis is correct
In hudi project, spark-avro module is only used for converting between spark's struct type and avro schema, only used two methods `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two methods are in `org.apache.spark.sql.avro.SchemaConverters` class. Analyse: 1, the `SchemaConverters` class are same in spark-master[1] and branch-3.0[2]. 2, from the import statements in `SchemaConverters`, we can learn that `SchemaConverters` doesn't depend on other class in spark-avro module. Also, I tried to move it hudi project and use a different package, compile go though. Use the hudi jar with shaded spark-avro module: 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert) 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert) So, if we shade the spark-avro is safe and will has better user experience, and we needn't shade it when spark-avro module is not external in spark project. Thanks, Lamber-Ken [1] https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala [2] https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala At 2020-02-14 10:30:35, "Vinoth Chandar" <[email protected]> wrote: >Just kicking this thread again, to make forward progress :) > >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]> wrote: > >> First of all.. No apologies, no feeling bad. We are all having fun here.. >> :) >> >> I think we are all on the same page on the tradeoffs here.. let's see if >> we can decide one way or other. >> >> Bundling spark-avro has better user experience, one less package to >> remember adding. But even with the valid points raised by udit and hmatu, I >> was just worried about specific things in spark-avro that may not be >> compatible with the spark version.. Can someone analyze how coupled >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a >> different avro version than spark 2.4.4 and when hudi-spark-bundle is used >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro >> version? >> >> If someone can provide data points on the above and if we can convince >> ourselves that we can bundle a different spark-avro version (even >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my >> position. Otherwise, if we might face a barrage of support issues with >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO .. >> >> TBH longer term, I am looking into if we can eliminate need for Row -> >> Avro conversion that we need spark-avro for. But lets ignore that for >> purposes of this discussion. >> >> Thanks >> Vinoth >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote: >> >>> Thanks for raising this! +1 to @Udit Mehrotra's point. >>> >>> >>> It's right that recommend users to actually build their own hudi jars, >>> with the spark version they use. It avoid the compatibility issues >>> >>> between user's local jars and pre-built hudi spark version(2.4.4). >>> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user >>> local env will contains that external dependency if they use avro. >>> >>> If not, to run hudi(release-0.5.1) is more complex for me, when using >>> Delta Lake, it's more simpler: >>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0" >>> >>> >>> >>> >>> >>> >>> >>> ------------------ Original ------------------ >>> From: "lamberken"<[email protected]>; >>> Date: Thu, Feb 6, 2020 07:42 AM >>> To: "dev"<[email protected]>; >>> >>> Subject: Re:[DISCUSS] Relocate spark-avro dependency by >>> maven-shade-plugin >>> >>> >>> >>> >>> >>> Dear team, >>> >>> >>> About this topic, there are some previous discussions in PR[1]. It's >>> better to visit it carefully before chiming in, thanks. >>> >>> >>> Current State: >>> Lamber-Ken: +1 >>> Udit Mehrotra: +1 >>> Bhavani Sudha: -1 >>> Vinoth Chandar: -1 >>> >>> >>> Thanks, >>> Lamber-Ken >>> >>> >>> >>> At 2020-02-06 06:10:52, "lamberken" <[email protected]> wrote: >>> > >>> > >>> >Dear team, >>> > >>> > >>> >With the 0.5.1 version released, user need to add >>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like >>> bellow >>> >>> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >>> >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ >>> > --packages >>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 >>> \ >>> > --conf >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' >>> >>> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >>> > >>> > >>> >From spark-avro-guide[1], we know that the spark-avro module is >>> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz. >>> >So may it's better to relocate spark-avro dependency by using >>> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does. >>> >>> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >>> >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ >>> > --packages >>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \ >>> > --conf >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' >>> >>> >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ >>> > >>> > >>> >I created a pr to fix this[3], we may need have more discussion about >>> this, any suggestion is welcome, thanks very much :) >>> >Current state: >>> >@bhasudha : +1 >>> >@vinoth : -1 >>> > >>> > >>> >[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html >>> >[2] >>> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz >>> >[3] https://github.com/apache/incubator-hudi/pull/1290 >>> > >> >>
