RE: Apache Hudi on AWS EMR

Dubey, Raghu Mon, 17 Feb 2020 21:47:14 -0800

Athena is indeed Presto inside, but there is lot of custom code which has gone 
on top of Presto there.
Couple months back I tried running a glue crawler to catalog a Hudi data set 
and then query it from Athena. The results were not same as what I would get 
with running the same query using spark SQL on EMR. Did not try Presto on EMR, 
but assuming it will work fine on EMR.

Athena integration with Hudi data set is planned shortly, but not sure of the 
date yet.

However, recently Athena started supporting integration to a Hive catalog apart 
from Glue. What that means is in Athena, if I connect to the Hive catalog on 
EMR, which is able to provide the Hudi views correctly, I should be able to get 
correct results on Athena. Have not tested it though. The feature is in Preview 
already.

Thanks
Raghu
-----Original Message-----
From: Shiyan Xu <[email protected]> 
Sent: Tuesday, February 18, 2020 6:20 AM
To: [email protected]
Cc: Mehrotra, Udit <[email protected]>; Raghvendra Dhar Dubey 
<[email protected]>
Subject: Re: Apache Hudi on AWS EMR

For 2) I think running presto on EMR is able to let you run read-optimized 
queries.
I don't quite understand how exactly Athena not support Hudi as it is Presto 
underlying.
Perhaps @Udit could give some insights from AWS?

As @Raghvendra you mentioned, another option is to export Hudi dataset to plain 
parquet files for Athena to query on
RFC-9 is for this usecase
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
The task is inactive now. Feel free to pick up if this is something you'd like 
to work on. I'd be happy to help with that.

On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <[email protected]> wrote:

> Hi Raghvendra,
>
> Quick sidebar.. Please subscribe to the mailing list, so your message 
> get published automatically. :)
>
> On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey 
> <[email protected]> wrote:
>
> > Hi Udit,
> >
> > Thanks for information.
> > Actually I am struggling on following points
> > 1 - How can we process S3 parquet files(hourly partitioned) through
> Apache
> > Hudi? Is there any streaming layer we need to introduce? 2 - Is 
> > there any workaround to query Hudi Dataset from Athena? we are 
> > thinking to dump resulting Hudi dataset to S3, and then querying 
> > from Athena. 3 - What should be the parquet file size and row group 
> > size for better performance on querying Hudi Dataset?
> >
> > Thanks
> > Raghvendra
> >
> >
> > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <[email protected]>
> wrote:
> >
> > > Hi Raghvendra,
> > >
> > > You would have to re-write you Parquet Dataset in Hudi format. 
> > > Here are the links you can follow to get started:
> > >
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> -dataset.html
> > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > > <[email protected]> wrote:
> > >
> > >     Hi Team,
> > >
> > >     I want to setup incremental view of my AWS S3 parquet data 
> > > through Apache
> > >     Hudi, and want to query this data through Athena, but 
> > > currently
> > Athena
> > > not
> > >     supporting Hudi Dataset.
> > >
> > >     so there are few questions which I want to understand here
> > >
> > >     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> > >
> > >     2 - How to query Hudi Dataset running on EMR
> > >
> > >     Please help me to understand this.
> > >
> > >     Thanks
> > >
> > >     Raghvendra
> > >
> > >
> > >
> >
>

RE: Apache Hudi on AWS EMR

Reply via email to