Re: Apache Hudi on AWS EMR

Vinoth Chandar Fri, 28 Feb 2020 09:11:40 -0800

https://issues.apache.org/jira/browse/HUDI-648 Filed to track error
tables..


Please ping on the ticket, if anyone is interested in picking it up.

On Fri, Feb 28, 2020 at 4:58 AM Raghvendra Dhar Dubey
<[email protected]> wrote:

> Hi Udit,
>
> I tried Hudi version 0.5.1, and it worked fine, this issue was appeared
> with Hudi 0.5.0. other EMR related issues has been discussed with Rahul.
> Thanks to all of you for cooperation.
>
> Thanks
> Raghvendra
>
> On Fri, Feb 28, 2020 at 5:34 AM Mehrotra, Udit <[email protected]> wrote:
>
> > Raghvendra,
> >
> > Can you enable TRACE level logging for Hudi on EMR, and provide the error
> > logs. For this go to /etc/spark/conf/log4j.properties and change logging
> > level of log4j.logger.org.apache.hudi to TRACE. This would help provide
> the
> > failed record/keys based off
> >
> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287
> >
> > Another thing that would help is to provide the Avro schema that gets
> > printed on the driver when you run your job. We need to understand which
> > field and why it is treated as INT96, because current parquet-avro does
> not
> > handle its conversion. Also, for any other questions about EMR we can
> > discuss it in the meeting you have setup with Rahul from EMR team.
> >
> > Thanks,
> > Udit
> >
> > On 2/27/20, 11:00 AM, "Shiyan Xu" <[email protected]> wrote:
> >
> >     +1 on the idea. Giving an config like `--error-path` where all failed
> >     conversions are saved provides flexibility for later processing.
> > SQS/SNS
> >     can pick that up later.
> >
> >     On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar <[email protected]>
> > wrote:
> >
> >     > On the second part, it seems like a question for EMR folks ?
> >     >
> >     > Hudi's RDD level APIs, do hand the failure records back and .. May
> > be we
> >     > should consider writing out the error records somewhere for the
> > datasource
> >     > as well.?
> >     > others any thoughts?
> >     >
> >     > On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
> >     > <[email protected]> wrote:
> >     >
> >     > > Thanks Gary and Udit,
> >     > >
> >     > > I tried HudiDeltaStreamer for reading parquet files from s3  but
> > there is
> >     > > an issue while AvroSchemaConverter not able to convert Parquet
> > INT96. so
> >     > I
> >     > > thought to use Spark Structured Streaming to read data from s3
> and
> > write
> >     > > into Hudi, but as Databricks providing "cloudfiles" for failure
> > handling,
> >     > > Is there something in EMR? or do we need to manually handle this
> > failure
> >     > by
> >     > > introducing SQS and SNS?
> >     > >
> >     > >
> >     > >
> >     > > On 2020/02/18 20:03:16, "Mehrotra, Udit"
> <[email protected]
> > >
> >     > > wrote:
> >     > > > Workaround provided by Gary can help querying Hudi tables
> through
> >     > Athena
> >     > > for Copy On Write tables by basically querying only the latest
> > commit
> >     > files
> >     > > as standard parquet. It would definitely be worth documenting, as
> > several
> >     > > people have asked for it and I remember providing the same
> > suggestion on
> >     > > slack earlier. I can add if I have the perms.
> >     > > >
> >     > > > >> if I connect to the Hive catalog on EMR, which is able to
> > provide
> >     > the
> >     > > >     Hudi views correctly, I should be able to get correct
> > results on
> >     > > Athena
> >     > > >
> >     > > > As Vinoth mentioned, just connecting to metastore is not
> enough.
> > Athena
> >     > > would still use its own Presto which does not support Hudi.
> >     > > >
> >     > > > As for Hudi support for Athena:
> >     > > > Athena does use Presto, but it's their own custom version and I
> > don't
> >     > > think they yet have the code that Hudi guys contributed to presto
> > i.e.
> >     > the
> >     > > split annotations etc. Also they don’t have Hudi jars in presto
> >     > classpath.
> >     > > We are not sure of any timelines for this support, but I have
> > heard that
> >     > > work should start soon.
> >     > > >
> >     > > > Thanks,
> >     > > > Udit
> >     > > >
> >     > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <[email protected]>
> > wrote:
> >     > > >
> >     > > >     Thanks everyone for chiming in. Esp Gary for the detailed
> >     > > workaround..
> >     > > >     (should we FAQ this workaround.. food for thought)
> >     > > >
> >     > > >     >> if I connect to the Hive catalog on EMR, which is able
> to
> >     > provide
> >     > > the
> >     > > >     Hudi views correctly, I should be able to get correct
> > results on
> >     > > Athena
> >     > > >
> >     > > >     Knowing how the Presto/Hudi integration works, simply being
> > able to
> >     > > read
> >     > > >     from Hive metastore is not enough. Presto has code to
> > specially
> >     > > recognize
> >     > > >     Hudi tables and does an additional filtering step, which
> > lets it
> >     > > query the
> >     > > >     data in there correctly. (Gary's workaround above keeps
> just
> > 1
> >     > > version
> >     > > >     around for a given file (group))..
> >     > > >
> >     > > >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <
> > [email protected]
> >     > >
> >     > > wrote:
> >     > > >
> >     > > >     > Hello, I don't have any experience working with Athena
> but
> > I can
> >     > > share my
> >     > > >     > experience working with Impala. There is a workaround.
> >     > > >     > By setting Hudi config:
> >     > > >     >
> >     > > >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> >     > > >     >    - hoodie.cleaner.fileversions.retained=1
> >     > > >     >
> >     > > >     > You will have your Hudi dataset as same as plain parquet
> > files.
> >     > > You can
> >     > > >     > create a table just like regular parquet. Hudi will write
> > a new
> >     > > commit
> >     > > >     > first then delete the older files that have two versions.
> > You
> >     > need
> >     > > to
> >     > > >     > refresh the table metadata store as soon as the Hudi
> > Upsert job
> >     > > finishes.
> >     > > >     > For impala, it's simply REFRESH TABLE xxx. After Hudi
> > vacuumed
> >     > the
> >     > > older
> >     > > >     > files and before refresh the table metastore, the table
> > will be
> >     > > unavailable
> >     > > >     > for query(1-5 mins in my case).
> >     > > >     >
> >     > > >     > How can we process S3 parquet files(hourly partitioned)
> > through
> >     > > Apache
> >     > > >     > Hudi? Is there any streaming layer we need to introduce?
> >     > > >     > -----------
> >     > > >     > Hudi Delta streamer support parquet file. You can do a
> > bulkInsert
> >     > > for the
> >     > > >     > first job then use delta streamer for the Upsert job.
> >     > > >     >
> >     > > >     > 3 - What should be the parquet file size and row group
> > size for
> >     > > better
> >     > > >     > performance on querying Hudi Dataset?
> >     > > >     > ----------
> >     > > >     > That depends on the query engine you are using and it
> > should be
> >     > > documented
> >     > > >     > somewhere. For impala, the optimal size for query
> > performance is
> >     > > 256MB, but
> >     > > >     > the larger file size will make upsert more expensive. The
> > size I
> >     > > personally
> >     > > >     > choose is 100MB to 128MB.
> >     > > >     >
> >     > > >     > Thanks,
> >     > > >     > Gary
> >     > > >     >
> >     > > >     >
> >     > > >     >
> >     > > >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> >     > > <[email protected]>
> >     > > >     > wrote:
> >     > > >     >
> >     > > >     > > Athena is indeed Presto inside, but there is lot of
> > custom code
> >     > > which has
> >     > > >     > > gone on top of Presto there.
> >     > > >     > > Couple months back I tried running a glue crawler to
> > catalog a
> >     > > Hudi data
> >     > > >     > > set and then query it from Athena. The results were not
> > same as
> >     > > what I
> >     > > >     > > would get with running the same query using spark SQL
> on
> > EMR.
> >     > > Did not try
> >     > > >     > > Presto on EMR, but assuming it will work fine on EMR.
> >     > > >     > >
> >     > > >     > > Athena integration with Hudi data set is planned
> > shortly, but
> >     > > not sure of
> >     > > >     > > the date yet.
> >     > > >     > >
> >     > > >     > > However, recently Athena started supporting integration
> > to a
> >     > > Hive catalog
> >     > > >     > > apart from Glue. What that means is in Athena, if I
> > connect to
> >     > > the Hive
> >     > > >     > > catalog on EMR, which is able to provide the Hudi views
> >     > > correctly, I
> >     > > >     > should
> >     > > >     > > be able to get correct results on Athena. Have not
> > tested it
> >     > > though. The
> >     > > >     > > feature is in Preview already.
> >     > > >     > >
> >     > > >     > > Thanks
> >     > > >     > > Raghu
> >     > > >     > > -----Original Message-----
> >     > > >     > > From: Shiyan Xu <[email protected]>
> >     > > >     > > Sent: Tuesday, February 18, 2020 6:20 AM
> >     > > >     > > To: [email protected]
> >     > > >     > > Cc: Mehrotra, Udit <[email protected]>; Raghvendra
> Dhar
> > Dubey
> >     > > >     > > <[email protected]>
> >     > > >     > > Subject: Re: Apache Hudi on AWS EMR
> >     > > >     > >
> >     > > >     > > For 2) I think running presto on EMR is able to let you
> > run
> >     > > >     > read-optimized
> >     > > >     > > queries.
> >     > > >     > > I don't quite understand how exactly Athena not support
> > Hudi as
> >     > > it is
> >     > > >     > > Presto underlying.
> >     > > >     > > Perhaps @Udit could give some insights from AWS?
> >     > > >     > >
> >     > > >     > > As @Raghvendra you mentioned, another option is to
> > export Hudi
> >     > > dataset to
> >     > > >     > > plain parquet files for Athena to query on
> >     > > >     > > RFC-9 is for this usecase
> >     > > >     > >
> >     > > >     > >
> >     > > >     >
> >     > >
> >     >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> >     > > >     > > The task is inactive now. Feel free to pick up if this
> is
> >     > > something you'd
> >     > > >     > > like to work on. I'd be happy to help with that.
> >     > > >     > >
> >     > > >     > >
> >     > > >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> >     > > [email protected]>
> >     > > >     > wrote:
> >     > > >     > >
> >     > > >     > > > Hi Raghvendra,
> >     > > >     > > >
> >     > > >     > > > Quick sidebar.. Please subscribe to the mailing list,
> > so your
> >     > > message
> >     > > >     > > > get published automatically. :)
> >     > > >     > > >
> >     > > >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> >     > > >     > > > <[email protected]> wrote:
> >     > > >     > > >
> >     > > >     > > > > Hi Udit,
> >     > > >     > > > >
> >     > > >     > > > > Thanks for information.
> >     > > >     > > > > Actually I am struggling on following points
> >     > > >     > > > > 1 - How can we process S3 parquet files(hourly
> > partitioned)
> >     > > through
> >     > > >     > > > Apache
> >     > > >     > > > > Hudi? Is there any streaming layer we need to
> > introduce? 2
> >     > -
> >     > > Is
> >     > > >     > > > > there any workaround to query Hudi Dataset from
> > Athena? we
> >     > > are
> >     > > >     > > > > thinking to dump resulting Hudi dataset to S3, and
> > then
> >     > > querying
> >     > > >     > > > > from Athena. 3 - What should be the parquet file
> > size and
> >     > > row group
> >     > > >     > > > > size for better performance on querying Hudi
> Dataset?
> >     > > >     > > > >
> >     > > >     > > > > Thanks
> >     > > >     > > > > Raghvendra
> >     > > >     > > > >
> >     > > >     > > > >
> >     > > >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
> >     > > [email protected]>
> >     > > >     > > > wrote:
> >     > > >     > > > >
> >     > > >     > > > > > Hi Raghvendra,
> >     > > >     > > > > >
> >     > > >     > > > > > You would have to re-write you Parquet Dataset in
> > Hudi
> >     > > format.
> >     > > >     > > > > > Here are the links you can follow to get started:
> >     > > >     > > > > >
> >     > > >     > > > > >
> >     > > >     > > > >
> >     > > >     > > >
> >     > >
> > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> >     > > >     > > > -dataset.html
> >     > > >     > > > > >
> >     > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> >     > > >     > > > > >
> >     > > >     > > > > > Thanks,
> >     > > >     > > > > > Udit
> >     > > >     > > > > >
> >     > > >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> >     > > >     > > > > > <[email protected]>
> wrote:
> >     > > >     > > > > >
> >     > > >     > > > > >     Hi Team,
> >     > > >     > > > > >
> >     > > >     > > > > >     I want to setup incremental view of my AWS S3
> > parquet
> >     > > data
> >     > > >     > > > > > through Apache
> >     > > >     > > > > >     Hudi, and want to query this data through
> > Athena, but
> >     > > >     > > > > > currently
> >     > > >     > > > > Athena
> >     > > >     > > > > > not
> >     > > >     > > > > >     supporting Hudi Dataset.
> >     > > >     > > > > >
> >     > > >     > > > > >     so there are few questions which I want to
> > understand
> >     > > here
> >     > > >     > > > > >
> >     > > >     > > > > >     1 - How to stream s3 parquet file to Hudi
> > dataset
> >     > > running on
> >     > > >     > EMR.
> >     > > >     > > > > >
> >     > > >     > > > > >     2 - How to query Hudi Dataset running on EMR
> >     > > >     > > > > >
> >     > > >     > > > > >     Please help me to understand this.
> >     > > >     > > > > >
> >     > > >     > > > > >     Thanks
> >     > > >     > > > > >
> >     > > >     > > > > >     Raghvendra
> >     > > >     > > > > >
> >     > > >     > > > > >
> >     > > >     > > > > >
> >     > > >     > > > >
> >     > > >     > > >
> >     > > >     > >
> >     > > >     >
> >     > > >
> >     > > >
> >     > > >
> >     > >
> >     >
> >
> >
> >
>

Re: Apache Hudi on AWS EMR

Reply via email to