Re:Re: Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

lamberken Thu, 20 Feb 2020 23:23:14 -0800




Thanks you all, had updated the pr[1]


Thanks
Lamber-Ken


[1] https://github.com/apache/incubator-hudi/pull/1290




At 2020-02-21 02:33:50, "Vinoth Chandar" <[email protected]> wrote:
>If there are no more comments/objections, we could re work the PR based on
>the discussion here..
>
>Points made by Udit are also pretty valid..
>
>Thanks for the constructive conversation. :)
>
>On Wed, Feb 19, 2020 at 3:12 PM lamberken <[email protected]> wrote:
>
>>
>>
>> @Vinoth, glad to see your reply.
>>
>>
>> >> SchemaConverters does import things like types
>> I checked the git history of package "org.apache.spark.sql.types", it
>> hasn't changed in a year,
>> means that spark does not change types often.
>>
>>
>> >> let's have a flag in maven to skip
>> Good suggestion. bundling it like we bundling
>> com.databricks:spark-avro_2.11 by default.
>> But how to use maven-shade-plugin with the flag, need to study.
>>
>>
>> Also, looking forward to others thoughts.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>>
>>
>>
>> At 2020-02-20 03:50:12, "Vinoth Chandar" <[email protected]> wrote:
>> >Apologies for the delayed response..
>> >
>> >I think SchemaConverters does import things like types and those will be
>> >tied to the spark version. if there are new types for e.g, our bundled
>> >spark-avro may not recognize them for e.g..
>> >
>> >import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
>> >import org.apache.spark.sql.types._
>> >import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
>> >minBytesForPrecision}
>> >
>> >
>> >I also verified that we are bundling avro in the spark-bundle.. So, that
>> >part we are in the clear.
>> >
>> >Here is what I suggest.. let's try bundling in the hope that it works i.e
>> >spark does not change types etc often and spark-avro interplays.
>> >But let's have a flag in maven to skip this bundling if need be.. We
>> should
>> >doc his clearly on the build instructions in the README?
>> >
>> >What do others think?
>> >
>> >
>> >
>> >On Sat, Feb 15, 2020 at 10:54 PM lamberken <[email protected]> wrote:
>> >
>> >>
>> >>
>> >> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>> >>
>> >>
>> >> In hudi project, spark-avro module is only used for converting between
>> >> spark's struct type and avro schema, only used two methods
>> >> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these
>> two
>> >> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>> >>
>> >>
>> >> Analyse:
>> >> 1, the `SchemaConverters` class are same in spark-master[1] and
>> >> branch-3.0[2].
>> >> 2, from the import statements in `SchemaConverters`, we can learn that
>> >> `SchemaConverters` doesn't depend on
>> >>    other class in spark-avro module.
>> >>    Also, I tried to move it hudi project and use a different package,
>> >> compile go though.
>> >>
>> >>
>> >> Use the hudi jar with shaded spark-avro module:
>> >> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
>> >> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>> >>
>> >>
>> >> So, if we shade the spark-avro is safe and will has better user
>> >> experience, and we needn't shade it when spark-avro module is not
>> external
>> >> in spark project.
>> >>
>> >>
>> >> Thanks,
>> >> Lamber-Ken
>> >>
>> >>
>> >> [1]
>> >>
>> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> >> [2]
>> >>
>> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2020-02-14 10:30:35, "Vinoth Chandar" <[email protected]> wrote:
>> >> >Just kicking this thread again, to make forward progress :)
>> >> >
>> >> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]>
>> wrote:
>> >> >
>> >> >> First of all.. No apologies, no feeling bad.  We are all having fun
>> >> here..
>> >> >> :)
>> >> >>
>> >> >> I think we are all on the same page on the tradeoffs here.. let's
>> see if
>> >> >> we can decide one way or other.
>> >> >>
>> >> >> Bundling spark-avro has better user experience, one less package to
>> >> >> remember adding. But even with the valid points raised by udit and
>> >> hmatu, I
>> >> >> was just worried about specific things in spark-avro that may not be
>> >> >> compatible with the spark version.. Can someone analyze how coupled
>> >> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x
>> uses a
>> >> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is
>> >> used
>> >> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that
>> avro
>> >> >> version?
>> >> >>
>> >> >> If someone can provide data points on the above and if we can
>> convince
>> >> >> ourselves that we can bundle a different spark-avro version (even
>> >> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
>> >> >> position. Otherwise, if we might face a barrage of support issues
>> with
>> >> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>> >> >>
>> >> >> TBH longer term, I am looking into if we can eliminate need for Row
>> ->
>> >> >> Avro conversion that we need spark-avro for. But lets ignore that for
>> >> >> purposes of this discussion.
>> >> >>
>> >> >> Thanks
>> >> >> Vinoth
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote:
>> >> >>
>> >> >>> Thanks for raising this! +1 to @Udit Mehrotra's point.
>> >> >>>
>> >> >>>
>> >> >>>  It's right that recommend users to actually build their  own hudi
>> >> jars,
>> >> >>> with the spark version they use. It avoid the compatibility issues
>> >> >>>
>> >> >>> between user's local jars and pre-built hudi spark version(2.4.4).
>> >> >>>
>> >> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
>> >> >>> local env will contains that external dependency if they use avro.
>> >> >>>
>> >> >>> If not, to run hudi(release-0.5.1) is more complex for me, when
>> using
>> >> >>> Delta Lake, it's more simpler:
>> >> >>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> ------------------&nbsp;Original&nbsp;------------------
>> >> >>> From:&nbsp;"lamberken"<[email protected]&gt;;
>> >> >>> Date:&nbsp;Thu, Feb 6, 2020 07:42 AM
>> >> >>> To:&nbsp;"dev"<[email protected]&gt;;
>> >> >>>
>> >> >>> Subject:&nbsp;Re:[DISCUSS] Relocate spark-avro dependency by
>> >> >>> maven-shade-plugin
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Dear team,
>> >> >>>
>> >> >>>
>> >> >>> About this topic, there are some previous discussions in PR[1]. It's
>> >> >>> better to visit it carefully before chiming in, thanks.
>> >> >>>
>> >> >>>
>> >> >>> Current State:
>> >> >>> Lamber-Ken: +1
>> >> >>> Udit Mehrotra: +1
>> >> >>> Bhavani Sudha: -1
>> >> >>> Vinoth Chandar: -1
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Lamber-Ken
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> At 2020-02-06 06:10:52, "lamberken" <[email protected]&gt; wrote:
>> >> >>> &gt;
>> >> >>> &gt;
>> >> >>> &gt;Dear team,
>> >> >>> &gt;
>> >> >>> &gt;
>> >> >>> &gt;With the 0.5.1 version released, user need to add
>> >> >>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command,
>> >> like
>> >> >>> bellow
>> >> >>>
>> >> >>>
>> >>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> >> >>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>> >> >>> &gt;&nbsp; --packages
>> >> >>>
>> >>
>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>> >> >>> \
>> >> >>> &gt;&nbsp; --conf
>> >> >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>> >> >>>
>> >> >>>
>> >>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> >> >>> &gt;
>> >> >>> &gt;
>> >> >>> &gt;From spark-avro-guide[1], we know that the spark-avro module is
>> >> >>> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
>> >> >>> &gt;So may it's better to relocate spark-avro dependency by using
>> >> >>> maven-shade-plugin. If so, user will starting hudi like 0.5.0
>> version
>> >> does.
>> >> >>>
>> >> >>>
>> >>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> >> >>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>> >> >>> &gt;&nbsp; --packages
>> >> >>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
>> >> >>> &gt;&nbsp; --conf
>> >> >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>> >> >>>
>> >> >>>
>> >>
>> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
>> >> >>> &gt;
>> >> >>> &gt;
>> >> >>> &gt;I created a pr to fix this[3], we may need have more discussion
>> >> about
>> >> >>> this, any suggestion is welcome, thanks very much :)
>> >> >>> &gt;Current state:
>> >> >>> &gt;@bhasudha : +1
>> >> >>> &gt;@vinoth&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : -1
>> >> >>> &gt;
>> >> >>> &gt;
>> >> >>> &gt;[1]
>> http://spark.apache.org/docs/latest/sql-data-sources-avro.html
>> >> >>> &gt;[2]
>> >> >>>
>> >>
>> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
>> >> >>> &gt;[3] https://github.com/apache/incubator-hudi/pull/1290
>> >> >>> &gt;
>> >> >>
>> >> >>
>> >>
>>
Re:Re: Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Reply via email to