If there are no more comments/objections, we could re work the PR based on the discussion here..
Points made by Udit are also pretty valid.. Thanks for the constructive conversation. :) On Wed, Feb 19, 2020 at 3:12 PM lamberken <[email protected]> wrote: > > > @Vinoth, glad to see your reply. > > > >> SchemaConverters does import things like types > I checked the git history of package "org.apache.spark.sql.types", it > hasn't changed in a year, > means that spark does not change types often. > > > >> let's have a flag in maven to skip > Good suggestion. bundling it like we bundling > com.databricks:spark-avro_2.11 by default. > But how to use maven-shade-plugin with the flag, need to study. > > > Also, looking forward to others thoughts. > > > Thanks, > Lamber-Ken > > > > > > At 2020-02-20 03:50:12, "Vinoth Chandar" <[email protected]> wrote: > >Apologies for the delayed response.. > > > >I think SchemaConverters does import things like types and those will be > >tied to the spark version. if there are new types for e.g, our bundled > >spark-avro may not recognize them for e.g.. > > > >import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator > >import org.apache.spark.sql.types._ > >import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes, > >minBytesForPrecision} > > > > > >I also verified that we are bundling avro in the spark-bundle.. So, that > >part we are in the clear. > > > >Here is what I suggest.. let's try bundling in the hope that it works i.e > >spark does not change types etc often and spark-avro interplays. > >But let's have a flag in maven to skip this bundling if need be.. We > should > >doc his clearly on the build instructions in the README? > > > >What do others think? > > > > > > > >On Sat, Feb 15, 2020 at 10:54 PM lamberken <[email protected]> wrote: > > > >> > >> > >> Hi @Vinoth, sorry delay for ensure the following analysis is correct > >> > >> > >> In hudi project, spark-avro module is only used for converting between > >> spark's struct type and avro schema, only used two methods > >> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these > two > >> methods are in `org.apache.spark.sql.avro.SchemaConverters` class. > >> > >> > >> Analyse: > >> 1, the `SchemaConverters` class are same in spark-master[1] and > >> branch-3.0[2]. > >> 2, from the import statements in `SchemaConverters`, we can learn that > >> `SchemaConverters` doesn't depend on > >> other class in spark-avro module. > >> Also, I tried to move it hudi project and use a different package, > >> compile go though. > >> > >> > >> Use the hudi jar with shaded spark-avro module: > >> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert) > >> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert) > >> > >> > >> So, if we shade the spark-avro is safe and will has better user > >> experience, and we needn't shade it when spark-avro module is not > external > >> in spark project. > >> > >> > >> Thanks, > >> Lamber-Ken > >> > >> > >> [1] > >> > https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala > >> [2] > >> > https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala > >> > >> > >> > >> > >> > >> > >> > >> At 2020-02-14 10:30:35, "Vinoth Chandar" <[email protected]> wrote: > >> >Just kicking this thread again, to make forward progress :) > >> > > >> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]> > wrote: > >> > > >> >> First of all.. No apologies, no feeling bad. We are all having fun > >> here.. > >> >> :) > >> >> > >> >> I think we are all on the same page on the tradeoffs here.. let's > see if > >> >> we can decide one way or other. > >> >> > >> >> Bundling spark-avro has better user experience, one less package to > >> >> remember adding. But even with the valid points raised by udit and > >> hmatu, I > >> >> was just worried about specific things in spark-avro that may not be > >> >> compatible with the spark version.. Can someone analyze how coupled > >> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x > uses a > >> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is > >> used > >> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that > avro > >> >> version? > >> >> > >> >> If someone can provide data points on the above and if we can > convince > >> >> ourselves that we can bundle a different spark-avro version (even > >> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my > >> >> position. Otherwise, if we might face a barrage of support issues > with > >> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO .. > >> >> > >> >> TBH longer term, I am looking into if we can eliminate need for Row > -> > >> >> Avro conversion that we need spark-avro for. But lets ignore that for > >> >> purposes of this discussion. > >> >> > >> >> Thanks > >> >> Vinoth > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote: > >> >> > >> >>> Thanks for raising this! +1 to @Udit Mehrotra's point. > >> >>> > >> >>> > >> >>> It's right that recommend users to actually build their own hudi > >> jars, > >> >>> with the spark version they use. It avoid the compatibility issues > >> >>> > >> >>> between user's local jars and pre-built hudi spark version(2.4.4). > >> >>> > >> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user > >> >>> local env will contains that external dependency if they use avro. > >> >>> > >> >>> If not, to run hudi(release-0.5.1) is more complex for me, when > using > >> >>> Delta Lake, it's more simpler: > >> >>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0" > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> ------------------ Original ------------------ > >> >>> From: "lamberken"<[email protected]>; > >> >>> Date: Thu, Feb 6, 2020 07:42 AM > >> >>> To: "dev"<[email protected]>; > >> >>> > >> >>> Subject: Re:[DISCUSS] Relocate spark-avro dependency by > >> >>> maven-shade-plugin > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> Dear team, > >> >>> > >> >>> > >> >>> About this topic, there are some previous discussions in PR[1]. It's > >> >>> better to visit it carefully before chiming in, thanks. > >> >>> > >> >>> > >> >>> Current State: > >> >>> Lamber-Ken: +1 > >> >>> Udit Mehrotra: +1 > >> >>> Bhavani Sudha: -1 > >> >>> Vinoth Chandar: -1 > >> >>> > >> >>> > >> >>> Thanks, > >> >>> Lamber-Ken > >> >>> > >> >>> > >> >>> > >> >>> At 2020-02-06 06:10:52, "lamberken" <[email protected]> wrote: > >> >>> > > >> >>> > > >> >>> >Dear team, > >> >>> > > >> >>> > > >> >>> >With the 0.5.1 version released, user need to add > >> >>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, > >> like > >> >>> bellow > >> >>> > >> >>> > >> > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > >> >>> >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > >> >>> > --packages > >> >>> > >> > org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 > >> >>> \ > >> >>> > --conf > >> >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' > >> >>> > >> >>> > >> > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > >> >>> > > >> >>> > > >> >>> >From spark-avro-guide[1], we know that the spark-avro module is > >> >>> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz. > >> >>> >So may it's better to relocate spark-avro dependency by using > >> >>> maven-shade-plugin. If so, user will starting hudi like 0.5.0 > version > >> does. > >> >>> > >> >>> > >> > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > >> >>> >spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > >> >>> > --packages > >> >>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \ > >> >>> > --conf > >> >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' > >> >>> > >> >>> > >> > >/-------------------------------------------------------------------------------------------------------------------------------------------------------------/ > >> >>> > > >> >>> > > >> >>> >I created a pr to fix this[3], we may need have more discussion > >> about > >> >>> this, any suggestion is welcome, thanks very much :) > >> >>> >Current state: > >> >>> >@bhasudha : +1 > >> >>> >@vinoth : -1 > >> >>> > > >> >>> > > >> >>> >[1] > http://spark.apache.org/docs/latest/sql-data-sources-avro.html > >> >>> >[2] > >> >>> > >> > http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz > >> >>> >[3] https://github.com/apache/incubator-hudi/pull/1290 > >> >>> > > >> >> > >> >> > >> >
