Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-19 Thread Mehrotra, Udit
Here are my 2 cents on this. @Vinoth just to add to your points:

>I think SchemaConverters does import things like types and those will be
>tied to the spark version. if there are new types for e.g, our bundled
>spark-avro may not recognize them for e.g..

If new types our added, with our current implementation we only perform the 
type conversion using spark-avro module. For any new type we either ways will 
have to make changes in Hudi to handle conversion of that data (of the new 
type) to avro type in AvroConversionHelper.scala, because the actual data 
conversion code is inside Hudi.

>I also verified that we are bundling avro in the spark-bundle.. So, that
>part we are in the clear.

We are not shading avro as I know of. We are only shading parquet-avro. Shading 
of avro only happens in hadoop-mr-bundle for it to understand LogicalTypes, as 
Hive uses an old version.

@lamber-ken

>> let's have a flag in maven to skip
https://github.com/apache/incubator-hudi/pull/957 is an example of how to do 
that with shading.

In general, I think since Hudi's conversion code is tightly coupled with the 
version of spark-avro currently supported i.e. spark 2.4.4, and changes in 
spark-avro's schema conversion logic (for ex: the namespace issue we ran into), 
or adding of new data types in spark would most likely result in additional 
work inside of Hudi to support/handle that in acutal data conversion, we might 
be okay sticking to a shaded spark-avro. +1 to have a flag to provide an option 
for customers to skip this.

Thanks,
Udit

On 2/19/20, 3:13 PM, "lamberken"  wrote:



@Vinoth, glad to see your reply.


>> SchemaConverters does import things like types
I checked the git history of package "org.apache.spark.sql.types", it 
hasn't changed in a year, 
means that spark does not change types often.


>> let's have a flag in maven to skip
Good suggestion. bundling it like we bundling 
com.databricks:spark-avro_2.11 by default. 
But how to use maven-shade-plugin with the flag, need to study.


Also, looking forward to others thoughts.


Thanks,
Lamber-Ken





At 2020-02-20 03:50:12, "Vinoth Chandar"  wrote:
>Apologies for the delayed response..
>
>I think SchemaConverters does import things like types and those will be
>tied to the spark version. if there are new types for e.g, our bundled
>spark-avro may not recognize them for e.g..
>
>import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
>import org.apache.spark.sql.types._
>import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
>minBytesForPrecision}
>
>
>I also verified that we are bundling avro in the spark-bundle.. So, that
>part we are in the clear.
>
>Here is what I suggest.. let's try bundling in the hope that it works i.e
>spark does not change types etc often and spark-avro interplays.
>But let's have a flag in maven to skip this bundling if need be.. We should
>doc his clearly on the build instructions in the README?
>
>What do others think?
>
>
>
>On Sat, Feb 15, 2020 at 10:54 PM lamberken  wrote:
>
>>
>>
>> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>>
>>
>> In hudi project, spark-avro module is only used for converting between
>> spark's struct type and avro schema, only used two methods
>> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two
>> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>>
>>
>> Analyse:
>> 1, the `SchemaConverters` class are same in spark-master[1] and
>> branch-3.0[2].
>> 2, from the import statements in `SchemaConverters`, we can learn that
>> `SchemaConverters` doesn't depend on
>>other class in spark-avro module.
>>Also, I tried to move it hudi project and use a different package,
>> compile go though.
>>
>>
>> Use the hudi jar with shaded spark-avro module:
>> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
>> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>>
>>
>> So, if we shade the spark-avro is safe and will has better user
>> experience, and we needn't shade it when spark-avro module is not 
external
>> in spark project.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> [2]
>> 
https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>>
>>
>>
>>
>>
>>
>>
>> At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
>> >Just kicking this thread again, to make forward progress :)
>> >
>

Re: Apache Hudi on AWS EMR

2020-02-19 Thread Bhavani Sudha Saktheeswaran
Got it. Thanks Udit!

On Wed, Feb 19, 2020 at 2:12 PM Mehrotra, Udit 
wrote:

> Hi Sudha,
>
> Yes EMR Presto since 5.28.0 release comes with presto jars present in the
> classpath. If you launch a cluster with Presto you should see it at:
>
> /usr/lib/presto/plugin/hive-hadoop2/hudi-presto-bundle.jar
>
> Thanks,
> Udit
>
>
> On 2/19/20, 1:53 PM, "Bhavani Sudha"  wrote:
>
> Hi Udit,
>
> Just a quick question on Presto EMR. Does EMR Presto support Hudi jars
> in
> its classpath ?
>
> On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit
> 
> wrote:
>
> > Workaround provided by Gary can help querying Hudi tables through
> Athena
> > for Copy On Write tables by basically querying only the latest
> commit files
> > as standard parquet. It would definitely be worth documenting, as
> several
> > people have asked for it and I remember providing the same
> suggestion on
> > slack earlier. I can add if I have the perms.
> >
> > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> > Hudi views correctly, I should be able to get correct results on
> Athena
> >
> > As Vinoth mentioned, just connecting to metastore is not enough.
> Athena
> > would still use its own Presto which does not support Hudi.
> >
> > As for Hudi support for Athena:
> > Athena does use Presto, but it's their own custom version and I don't
> > think they yet have the code that Hudi guys contributed to presto
> i.e. the
> > split annotations etc. Also they don’t have Hudi jars in presto
> classpath.
> > We are not sure of any timelines for this support, but I have heard
> that
> > work should start soon.
> >
> > Thanks,
> > Udit
> >
> > On 2/18/20, 11:27 AM, "Vinoth Chandar"  wrote:
> >
> > Thanks everyone for chiming in. Esp Gary for the detailed
> workaround..
> > (should we FAQ this workaround.. food for thought)
> >
> > >> if I connect to the Hive catalog on EMR, which is able to
> provide
> > the
> > Hudi views correctly, I should be able to get correct results on
> Athena
> >
> > Knowing how the Presto/Hudi integration works, simply being able
> to
> > read
> > from Hive metastore is not enough. Presto has code to specially
> > recognize
> > Hudi tables and does an additional filtering step, which lets it
> query
> > the
> > data in there correctly. (Gary's workaround above keeps just 1
> version
> > around for a given file (group))..
> >
> > On Mon, Feb 17, 2020 at 11:28 PM Gary Li <
> yanjia.gary...@gmail.com>
> > wrote:
> >
> > > Hello, I don't have any experience working with Athena but I
> can
> > share my
> > > experience working with Impala. There is a workaround.
> > > By setting Hudi config:
> > >
> > >- hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> > >- hoodie.cleaner.fileversions.retained=1
> > >
> > > You will have your Hudi dataset as same as plain parquet
> files. You
> > can
> > > create a table just like regular parquet. Hudi will write a new
> > commit
> > > first then delete the older files that have two versions. You
> need to
> > > refresh the table metadata store as soon as the Hudi Upsert job
> > finishes.
> > > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed
> the
> > older
> > > files and before refresh the table metastore, the table will be
> > unavailable
> > > for query(1-5 mins in my case).
> > >
> > > How can we process S3 parquet files(hourly partitioned) through
> > Apache
> > > Hudi? Is there any streaming layer we need to introduce?
> > > ---
> > > Hudi Delta streamer support parquet file. You can do a
> bulkInsert
> > for the
> > > first job then use delta streamer for the Upsert job.
> > >
> > > 3 - What should be the parquet file size and row group size for
> > better
> > > performance on querying Hudi Dataset?
> > > --
> > > That depends on the query engine you are using and it should be
> > documented
> > > somewhere. For impala, the optimal size for query performance
> is
> > 256MB, but
> > > the larger file size will make upsert more expensive. The size
> I
> > personally
> > > choose is 100MB to 128MB.
> > >
> > > Thanks,
> > > Gary
> > >
> > >
> > >
> > > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> > 
> > > wrote:
> > >
> > > > Athena is indeed Presto inside, but there is lot of custom
> code
> > which has
> > > > gone on top of Presto there.
> > > > Couple months back I tried running a glue crawler to catalog
> a
> > Hudi data
>  

Re:Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-19 Thread lamberken


@Vinoth, glad to see your reply.


>> SchemaConverters does import things like types
I checked the git history of package "org.apache.spark.sql.types", it hasn't 
changed in a year, 
means that spark does not change types often.


>> let's have a flag in maven to skip
Good suggestion. bundling it like we bundling com.databricks:spark-avro_2.11 by 
default. 
But how to use maven-shade-plugin with the flag, need to study.


Also, looking forward to others thoughts.


Thanks,
Lamber-Ken





At 2020-02-20 03:50:12, "Vinoth Chandar"  wrote:
>Apologies for the delayed response..
>
>I think SchemaConverters does import things like types and those will be
>tied to the spark version. if there are new types for e.g, our bundled
>spark-avro may not recognize them for e.g..
>
>import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
>import org.apache.spark.sql.types._
>import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
>minBytesForPrecision}
>
>
>I also verified that we are bundling avro in the spark-bundle.. So, that
>part we are in the clear.
>
>Here is what I suggest.. let's try bundling in the hope that it works i.e
>spark does not change types etc often and spark-avro interplays.
>But let's have a flag in maven to skip this bundling if need be.. We should
>doc his clearly on the build instructions in the README?
>
>What do others think?
>
>
>
>On Sat, Feb 15, 2020 at 10:54 PM lamberken  wrote:
>
>>
>>
>> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>>
>>
>> In hudi project, spark-avro module is only used for converting between
>> spark's struct type and avro schema, only used two methods
>> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two
>> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>>
>>
>> Analyse:
>> 1, the `SchemaConverters` class are same in spark-master[1] and
>> branch-3.0[2].
>> 2, from the import statements in `SchemaConverters`, we can learn that
>> `SchemaConverters` doesn't depend on
>>other class in spark-avro module.
>>Also, I tried to move it hudi project and use a different package,
>> compile go though.
>>
>>
>> Use the hudi jar with shaded spark-avro module:
>> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
>> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>>
>>
>> So, if we shade the spark-avro is safe and will has better user
>> experience, and we needn't shade it when spark-avro module is not external
>> in spark project.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> [2]
>> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>>
>>
>>
>>
>>
>>
>>
>> At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
>> >Just kicking this thread again, to make forward progress :)
>> >
>> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar  wrote:
>> >
>> >> First of all.. No apologies, no feeling bad.  We are all having fun
>> here..
>> >> :)
>> >>
>> >> I think we are all on the same page on the tradeoffs here.. let's see if
>> >> we can decide one way or other.
>> >>
>> >> Bundling spark-avro has better user experience, one less package to
>> >> remember adding. But even with the valid points raised by udit and
>> hmatu, I
>> >> was just worried about specific things in spark-avro that may not be
>> >> compatible with the spark version.. Can someone analyze how coupled
>> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
>> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is
>> used
>> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
>> >> version?
>> >>
>> >> If someone can provide data points on the above and if we can convince
>> >> ourselves that we can bundle a different spark-avro version (even
>> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
>> >> position. Otherwise, if we might face a barrage of support issues with
>> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>> >>
>> >> TBH longer term, I am looking into if we can eliminate need for Row ->
>> >> Avro conversion that we need spark-avro for. But lets ignore that for
>> >> purposes of this discussion.
>> >>
>> >> Thanks
>> >> Vinoth
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu  wrote:
>> >>
>> >>> Thanks for raising this! +1 to @Udit Mehrotra's point.
>> >>>
>> >>>
>> >>>  It's right that recommend users to actually build their  own hudi
>> jars,
>> >>> with the spark version they use. It avoid the compatibility issues
>> >>>
>> >>> between user's local jars and pre-built hudi spark version(2.4.4).
>> >>>
>> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
>> >>> local env will contains that external depen

Re: Apache Hudi on AWS EMR

2020-02-19 Thread Mehrotra, Udit
Hi Sudha,

Yes EMR Presto since 5.28.0 release comes with presto jars present in the 
classpath. If you launch a cluster with Presto you should see it at:

/usr/lib/presto/plugin/hive-hadoop2/hudi-presto-bundle.jar

Thanks,
Udit


On 2/19/20, 1:53 PM, "Bhavani Sudha"  wrote:

Hi Udit,

Just a quick question on Presto EMR. Does EMR Presto support Hudi jars in
its classpath ?

On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit 
wrote:

> Workaround provided by Gary can help querying Hudi tables through Athena
> for Copy On Write tables by basically querying only the latest commit 
files
> as standard parquet. It would definitely be worth documenting, as several
> people have asked for it and I remember providing the same suggestion on
> slack earlier. I can add if I have the perms.
>
> >> if I connect to the Hive catalog on EMR, which is able to provide the
> Hudi views correctly, I should be able to get correct results on 
Athena
>
> As Vinoth mentioned, just connecting to metastore is not enough. Athena
> would still use its own Presto which does not support Hudi.
>
> As for Hudi support for Athena:
> Athena does use Presto, but it's their own custom version and I don't
> think they yet have the code that Hudi guys contributed to presto i.e. the
> split annotations etc. Also they don’t have Hudi jars in presto classpath.
> We are not sure of any timelines for this support, but I have heard that
> work should start soon.
>
> Thanks,
> Udit
>
> On 2/18/20, 11:27 AM, "Vinoth Chandar"  wrote:
>
> Thanks everyone for chiming in. Esp Gary for the detailed workaround..
> (should we FAQ this workaround.. food for thought)
>
> >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> Hudi views correctly, I should be able to get correct results on 
Athena
>
> Knowing how the Presto/Hudi integration works, simply being able to
> read
> from Hive metastore is not enough. Presto has code to specially
> recognize
> Hudi tables and does an additional filtering step, which lets it query
> the
> data in there correctly. (Gary's workaround above keeps just 1 version
> around for a given file (group))..
>
> On Mon, Feb 17, 2020 at 11:28 PM Gary Li 
> wrote:
>
> > Hello, I don't have any experience working with Athena but I can
> share my
> > experience working with Impala. There is a workaround.
> > By setting Hudi config:
> >
> >- hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> >- hoodie.cleaner.fileversions.retained=1
> >
> > You will have your Hudi dataset as same as plain parquet files. You
> can
> > create a table just like regular parquet. Hudi will write a new
> commit
> > first then delete the older files that have two versions. You need 
to
> > refresh the table metadata store as soon as the Hudi Upsert job
> finishes.
> > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the
> older
> > files and before refresh the table metastore, the table will be
> unavailable
> > for query(1-5 mins in my case).
> >
> > How can we process S3 parquet files(hourly partitioned) through
> Apache
> > Hudi? Is there any streaming layer we need to introduce?
> > ---
> > Hudi Delta streamer support parquet file. You can do a bulkInsert
> for the
> > first job then use delta streamer for the Upsert job.
> >
> > 3 - What should be the parquet file size and row group size for
> better
> > performance on querying Hudi Dataset?
> > --
> > That depends on the query engine you are using and it should be
> documented
> > somewhere. For impala, the optimal size for query performance is
> 256MB, but
> > the larger file size will make upsert more expensive. The size I
> personally
> > choose is 100MB to 128MB.
> >
> > Thanks,
> > Gary
> >
> >
> >
> > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> 
> > wrote:
> >
> > > Athena is indeed Presto inside, but there is lot of custom code
> which has
> > > gone on top of Presto there.
> > > Couple months back I tried running a glue crawler to catalog a
> Hudi data
> > > set and then query it from Athena. The results were not same as
> what I
> > > would get with running the same query using spark SQL on EMR. Did
> not try
> > > Presto on EMR, but assuming it will work fine on EMR.
> > >
> > > Athena integration with Hudi data set is planned shortly, but not
> sure of
> > > t

Re: Apache Hudi on AWS EMR

2020-02-19 Thread Bhavani Sudha
Hi Udit,

Just a quick question on Presto EMR. Does EMR Presto support Hudi jars in
its classpath ?

On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit 
wrote:

> Workaround provided by Gary can help querying Hudi tables through Athena
> for Copy On Write tables by basically querying only the latest commit files
> as standard parquet. It would definitely be worth documenting, as several
> people have asked for it and I remember providing the same suggestion on
> slack earlier. I can add if I have the perms.
>
> >> if I connect to the Hive catalog on EMR, which is able to provide the
> Hudi views correctly, I should be able to get correct results on Athena
>
> As Vinoth mentioned, just connecting to metastore is not enough. Athena
> would still use its own Presto which does not support Hudi.
>
> As for Hudi support for Athena:
> Athena does use Presto, but it's their own custom version and I don't
> think they yet have the code that Hudi guys contributed to presto i.e. the
> split annotations etc. Also they don’t have Hudi jars in presto classpath.
> We are not sure of any timelines for this support, but I have heard that
> work should start soon.
>
> Thanks,
> Udit
>
> On 2/18/20, 11:27 AM, "Vinoth Chandar"  wrote:
>
> Thanks everyone for chiming in. Esp Gary for the detailed workaround..
> (should we FAQ this workaround.. food for thought)
>
> >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> Hudi views correctly, I should be able to get correct results on Athena
>
> Knowing how the Presto/Hudi integration works, simply being able to
> read
> from Hive metastore is not enough. Presto has code to specially
> recognize
> Hudi tables and does an additional filtering step, which lets it query
> the
> data in there correctly. (Gary's workaround above keeps just 1 version
> around for a given file (group))..
>
> On Mon, Feb 17, 2020 at 11:28 PM Gary Li 
> wrote:
>
> > Hello, I don't have any experience working with Athena but I can
> share my
> > experience working with Impala. There is a workaround.
> > By setting Hudi config:
> >
> >- hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> >- hoodie.cleaner.fileversions.retained=1
> >
> > You will have your Hudi dataset as same as plain parquet files. You
> can
> > create a table just like regular parquet. Hudi will write a new
> commit
> > first then delete the older files that have two versions. You need to
> > refresh the table metadata store as soon as the Hudi Upsert job
> finishes.
> > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the
> older
> > files and before refresh the table metastore, the table will be
> unavailable
> > for query(1-5 mins in my case).
> >
> > How can we process S3 parquet files(hourly partitioned) through
> Apache
> > Hudi? Is there any streaming layer we need to introduce?
> > ---
> > Hudi Delta streamer support parquet file. You can do a bulkInsert
> for the
> > first job then use delta streamer for the Upsert job.
> >
> > 3 - What should be the parquet file size and row group size for
> better
> > performance on querying Hudi Dataset?
> > --
> > That depends on the query engine you are using and it should be
> documented
> > somewhere. For impala, the optimal size for query performance is
> 256MB, but
> > the larger file size will make upsert more expensive. The size I
> personally
> > choose is 100MB to 128MB.
> >
> > Thanks,
> > Gary
> >
> >
> >
> > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> 
> > wrote:
> >
> > > Athena is indeed Presto inside, but there is lot of custom code
> which has
> > > gone on top of Presto there.
> > > Couple months back I tried running a glue crawler to catalog a
> Hudi data
> > > set and then query it from Athena. The results were not same as
> what I
> > > would get with running the same query using spark SQL on EMR. Did
> not try
> > > Presto on EMR, but assuming it will work fine on EMR.
> > >
> > > Athena integration with Hudi data set is planned shortly, but not
> sure of
> > > the date yet.
> > >
> > > However, recently Athena started supporting integration to a Hive
> catalog
> > > apart from Glue. What that means is in Athena, if I connect to the
> Hive
> > > catalog on EMR, which is able to provide the Hudi views correctly,
> I
> > should
> > > be able to get correct results on Athena. Have not tested it
> though. The
> > > feature is in Preview already.
> > >
> > > Thanks
> > > Raghu
> > > -Original Message-
> > > From: Shiyan Xu 
> > > Sent: Tuesday, February 18, 2020 6:20 AM
> > > To: dev@hudi.apache.org
> > > Cc: Mehrotra, Udit ; Raghvendra Dhar Dubey
> > > 
> > > Subject: Re: Apache Hudi on AWS EMR
> > >
> > > For 

Re: Hudi on EMR syncing GLUE catalog issue

2020-02-19 Thread Igor Basko
Thanks a lot for the suggestion, will try it out.

On Wed, 19 Feb 2020 at 00:36, Mehrotra, Udit 
wrote:

> Hi Igor,
>
> As of current implementation, Hudi submits queries like creating table,
> syncing partitions etc directly to the hive server instead of directly
> communicating with the metastore. Thus while launching the EMR cluster, you
> should install Hive on the cluster as well. Also enable glue catalog for
> both spark and hive and you should be fine.
>
> Thanks,
> Udit Mehrotra
> AWS | EMR
>
> On 2/18/20, 2:29 AM, "Igor Basko"  wrote:
>
> Hi Dear List,
> I'm trying to catalog Hudi files in GLUE catalog using the sync hive
> tool,
> while using the spark save function (and not the standalone version).
>
> I've created an EMR with Spark application only (without Hive). Also
> added
> the following hive metastore client factory class configuration:
> "hive.metastore.client.factory.class":
>
> "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
>
> I've started the spark-shell using the provided by EMR hudi jars, and
> also
> using the 0.5.1 version and they both gave me the "Cannot create hive
> connection ..." error when running the following code
> .
> (
> https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a)
>
> After looking inside HoodieSparkSqlWriter.scala in buildSyncConfig it
> seems
> that there is no way to override the HiveSyncConfig.useJdbc variable
> to be
> false,
> (
>
> https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L232
> )
> which means that in HoodieHiveClient constructor it will always try to
> createHiveConnection()
> (
>
> https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L111
> )
> Instead of creating a hive client from the configuration.
>
> The next thing I did was to add a parameter that would enable
> overriding
> the useJdbc variable.
> Used the custom hudi jar in the EMR, and was able to progress further.
> But
> got a different error down the line.
> What I was happy to see that apparently it was using the
> AWSGlueClientFactory:
> 20/02/17 13:55:17 INFO AWSGlueClientFactory: Using region from ec2
> metadata
> : eu-west-1
>
> And was able to detect that the table doesn't exists in GLUE:
> 20/02/17 13:55:18 INFO HiveSyncTool: Hive table drivers is not found.
> Creating it
>
> But I got the following exception:
> java.lang.NoClassDefFoundError: org/json/JSONException
>   at
>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
>
> A partial log could be found here
> 
> (
> https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3)
>
> As it seems to me, in the case of checking if a table exists, the
> HoodieHiveClient uses the client variable which is an interface
> IMetaStoreClient, that the AWSCatalogMetastoreClient implements.
> And it works fine.
>
>
> https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L469
>
>
> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-spark-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java
>
> But the createTable of HoodieHiveClient, eventually creates a
> hive.ql.Driver and not uses the AWS client, which eventually gets an
> exception.
>
> So what I would like to know, is am I doing it wrong when trying to
> sync to
> GLUE?
> Or maybe currently Hudi doesn't support updating GLUE catalog without
> some
> code changes?
>
> Best Regards,
> Igor
>
>
>


Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-19 Thread Vinoth Chandar
Apologies for the delayed response..

I think SchemaConverters does import things like types and those will be
tied to the spark version. if there are new types for e.g, our bundled
spark-avro may not recognize them for e.g..

import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
minBytesForPrecision}


I also verified that we are bundling avro in the spark-bundle.. So, that
part we are in the clear.

Here is what I suggest.. let's try bundling in the hope that it works i.e
spark does not change types etc often and spark-avro interplays.
But let's have a flag in maven to skip this bundling if need be.. We should
doc his clearly on the build instructions in the README?

What do others think?



On Sat, Feb 15, 2020 at 10:54 PM lamberken  wrote:

>
>
> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>
>
> In hudi project, spark-avro module is only used for converting between
> spark's struct type and avro schema, only used two methods
> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two
> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>
>
> Analyse:
> 1, the `SchemaConverters` class are same in spark-master[1] and
> branch-3.0[2].
> 2, from the import statements in `SchemaConverters`, we can learn that
> `SchemaConverters` doesn't depend on
>other class in spark-avro module.
>Also, I tried to move it hudi project and use a different package,
> compile go though.
>
>
> Use the hudi jar with shaded spark-avro module:
> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>
>
> So, if we shade the spark-avro is safe and will has better user
> experience, and we needn't shade it when spark-avro module is not external
> in spark project.
>
>
> Thanks,
> Lamber-Ken
>
>
> [1]
> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
> [2]
> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>
>
>
>
>
>
>
> At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
> >Just kicking this thread again, to make forward progress :)
> >
> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar  wrote:
> >
> >> First of all.. No apologies, no feeling bad.  We are all having fun
> here..
> >> :)
> >>
> >> I think we are all on the same page on the tradeoffs here.. let's see if
> >> we can decide one way or other.
> >>
> >> Bundling spark-avro has better user experience, one less package to
> >> remember adding. But even with the valid points raised by udit and
> hmatu, I
> >> was just worried about specific things in spark-avro that may not be
> >> compatible with the spark version.. Can someone analyze how coupled
> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is
> used
> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
> >> version?
> >>
> >> If someone can provide data points on the above and if we can convince
> >> ourselves that we can bundle a different spark-avro version (even
> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
> >> position. Otherwise, if we might face a barrage of support issues with
> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
> >>
> >> TBH longer term, I am looking into if we can eliminate need for Row ->
> >> Avro conversion that we need spark-avro for. But lets ignore that for
> >> purposes of this discussion.
> >>
> >> Thanks
> >> Vinoth
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Feb 5, 2020 at 10:54 PM hmatu  wrote:
> >>
> >>> Thanks for raising this! +1 to @Udit Mehrotra's point.
> >>>
> >>>
> >>>  It's right that recommend users to actually build their  own hudi
> jars,
> >>> with the spark version they use. It avoid the compatibility issues
> >>>
> >>> between user's local jars and pre-built hudi spark version(2.4.4).
> >>>
> >>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
> >>> local env will contains that external dependency if they use avro.
> >>>
> >>> If not, to run hudi(release-0.5.1) is more complex for me, when using
> >>> Delta Lake, it's more simpler:
> >>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -- Original --
> >>> From: "lamberken" >>> Date: Thu, Feb 6, 2020 07:42 AM
> >>> To: "dev" >>>
> >>> Subject: Re:[DISCUSS] Relocate spark-avro dependency by
> >>> maven-shade-plugin
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Dear team,
> >>>
> >>>
> >>> About this topic, there are some previous discussions in PR[1]. It's
> >>> better to visit it carefully before chiming in, thanks.
> >>>

Re: updatePartitionsToTable() is time consuming and redundant.

2020-02-19 Thread vbal...@apache.org
 
Hi Pratyaksh/Purushotham,
I spent some time in the morning trying to reproduce this locally but unable 
to. There is an unit-test TestHiveSyncTool.testSyncIncremental which is quite 
close to the setup we need to repro. 
I added the below check and it passed (meaning works as expected with no 
unnecessary update partitions call). Can you use the below code to try 
reproducing it locally and in the real ecosystem to see what is happening.
Balaji.V
```System.out.println("DUPLICATE CHECK");
String commitTime3 = "102";
TestUtil.addCOWPartitions(1, true, dateTime, commitTime3);
hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig, 
TestUtil.getHiveConf(), TestUtil.fileSystem);
writtenPartitionsSince = 
hiveClient.getPartitionsWrittenToSince(Option.of(commitTime2));
System.out.println("Added Partitions :" + writtenPartitionsSince);
assertEquals(1, writtenPartitionsSince.size());
hivePartitions = 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
partitionEvents = hiveClient.getPartitionEvents(hivePartitions, 
writtenPartitionsSince);
assertEquals("No partition events", 0, partitionEvents.size());

tool = new HiveSyncTool(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), 
TestUtil.fileSystem);
tool.syncHoodieTable();
// Sync should add the one partition
assertEquals(6, 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
assertEquals("The last commit that was sycned should be 102", commitTime3,

hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());
On Wednesday, February 19, 2020, 04:08:39 AM PST, Pratyaksh Sharma 
 wrote:  
 
 Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
 wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>    Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>        + "=20191117\n");
>    System.out.println("Path is : " + path.toUri().getPath());
>    System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>    String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>    System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>    On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar  wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/0

Re: updatePartitionsToTable() is time consuming and redundant.

2020-02-19 Thread Pratyaksh Sharma
Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
 wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
> Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
> + "=20191117\n");
> System.out.println("Path is : " + path.toUri().getPath());
> System.out.println("Is Absolute :" + path.isUriPathAbsolute());
> String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
> System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
> On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar  wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Exis