Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Vinoth Chandar Wed, 19 Feb 2020 11:51:06 -0800

Apologies for the delayed response..

I think SchemaConverters does import things like types and those will be
tied to the spark version. if there are new types for e.g, our bundled
spark-avro may not recognize them for e.g..


import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
minBytesForPrecision}


I also verified that we are bundling avro in the spark-bundle.. So, that
part we are in the clear.

Here is what I suggest.. let's try bundling in the hope that it works i.e
spark does not change types etc often and spark-avro interplays.
But let's have a flag in maven to skip this bundling if need be.. We should
doc his clearly on the build instructions in the README?

What do others think?



On Sat, Feb 15, 2020 at 10:54 PM lamberken <[email protected]> wrote:

>
>
> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>
>
> In hudi project, spark-avro module is only used for converting between
> spark's struct type and avro schema, only used two methods
> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two
> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>
>
> Analyse:
> 1, the `SchemaConverters` class are same in spark-master[1] and
> branch-3.0[2].
> 2, from the import statements in `SchemaConverters`, we can learn that
> `SchemaConverters` doesn't depend on
>    other class in spark-avro module.
>    Also, I tried to move it hudi project and use a different package,
> compile go though.
>
>
> Use the hudi jar with shaded spark-avro module:
> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>
>
> So, if we shade the spark-avro is safe and will has better user
> experience, and we needn't shade it when spark-avro module is not external
> in spark project.
>
>
> Thanks,
> Lamber-Ken
>
>
> [1]
> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
> [2]
> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>
>
>
>
>
>
>
> At 2020-02-14 10:30:35, "Vinoth Chandar" <[email protected]> wrote:
> >Just kicking this thread again, to make forward progress :)
> >
> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar <[email protected]> wrote:
> >
> >> First of all.. No apologies, no feeling bad.  We are all having fun
> here..
> >> :)
> >>
> >> I think we are all on the same page on the tradeoffs here.. let's see if
> >> we can decide one way or other.
> >>
> >> Bundling spark-avro has better user experience, one less package to
> >> remember adding. But even with the valid points raised by udit and
> hmatu, I
> >> was just worried about specific things in spark-avro that may not be
> >> compatible with the spark version.. Can someone analyze how coupled
> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is
> used
> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
> >> version?
> >>
> >> If someone can provide data points on the above and if we can convince
> >> ourselves that we can bundle a different spark-avro version (even
> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
> >> position. Otherwise, if we might face a barrage of support issues with
> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
> >>
> >> TBH longer term, I am looking into if we can eliminate need for Row ->
> >> Avro conversion that we need spark-avro for. But lets ignore that for
> >> purposes of this discussion.
> >>
> >> Thanks
> >> Vinoth
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu <[email protected]> wrote:
> >>
> >>> Thanks for raising this! +1 to @Udit Mehrotra's point.
> >>>
> >>>
> >>>  It's right that recommend users to actually build their  own hudi
> jars,
> >>> with the spark version they use. It avoid the compatibility issues
> >>>
> >>> between user's local jars and pre-built hudi spark version(2.4.4).
> >>>
> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
> >>> local env will contains that external dependency if they use avro.
> >>>
> >>> If not, to run hudi(release-0.5.1) is more complex for me, when using
> >>> Delta Lake, it's more simpler:
> >>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ------------------&nbsp;Original&nbsp;------------------
> >>> From:&nbsp;"lamberken"<[email protected]&gt;;
> >>> Date:&nbsp;Thu, Feb 6, 2020 07:42 AM
> >>> To:&nbsp;"dev"<[email protected]&gt;;
> >>>
> >>> Subject:&nbsp;Re:[DISCUSS] Relocate spark-avro dependency by
> >>> maven-shade-plugin
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Dear team,
> >>>
> >>>
> >>> About this topic, there are some previous discussions in PR[1]. It's
> >>> better to visit it carefully before chiming in, thanks.
> >>>
> >>>
> >>> Current State:
> >>> Lamber-Ken: +1
> >>> Udit Mehrotra: +1
> >>> Bhavani Sudha: -1
> >>> Vinoth Chandar: -1
> >>>
> >>>
> >>> Thanks,
> >>> Lamber-Ken
> >>>
> >>>
> >>>
> >>> At 2020-02-06 06:10:52, "lamberken" <[email protected]&gt; wrote:
> >>> &gt;
> >>> &gt;
> >>> &gt;Dear team,
> >>> &gt;
> >>> &gt;
> >>> &gt;With the 0.5.1 version released, user need to add
> >>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command,
> like
> >>> bellow
> >>>
> >>>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> >>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
> >>> &gt;&nbsp; --packages
> >>>
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
> >>> \
> >>> &gt;&nbsp; --conf
> >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> >>>
> >>>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> >>> &gt;
> >>> &gt;
> >>> &gt;From spark-avro-guide[1], we know that the spark-avro module is
> >>> external, it is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
> >>> &gt;So may it's better to relocate spark-avro dependency by using
> >>> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version
> does.
> >>>
> >>>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> >>> &gt;spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
> >>> &gt;&nbsp; --packages
> >>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
> >>> &gt;&nbsp; --conf
> >>> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> >>>
> >>>
> &gt;/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
> >>> &gt;
> >>> &gt;
> >>> &gt;I created a pr to fix this[3], we may need have more discussion
> about
> >>> this, any suggestion is welcome, thanks very much :)
> >>> &gt;Current state:
> >>> &gt;@bhasudha : +1
> >>> &gt;@vinoth&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : -1
> >>> &gt;
> >>> &gt;
> >>> &gt;[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
> >>> &gt;[2]
> >>>
> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> >>> &gt;[3] https://github.com/apache/incubator-hudi/pull/1290
> >>> &gt;
> >>
> >>
>

Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

Reply via email to