Re: Apache Hudi on AWS EMR

Mehrotra, Udit Tue, 18 Feb 2020 12:04:07 -0800

Workaround provided by Gary can help querying Hudi tables through Athena for 
Copy On Write tables by basically querying only the latest commit files as 
standard parquet. It would definitely be worth documenting, as several people 
have asked for it and I remember providing the same suggestion on slack 
earlier. I can add if I have the perms.


>> if I connect to the Hive catalog on EMR, which is able to provide the
    Hudi views correctly, I should be able to get correct results on Athena

As Vinoth mentioned, just connecting to metastore is not enough. Athena would 
still use its own Presto which does not support Hudi.

As for Hudi support for Athena:
Athena does use Presto, but it's their own custom version and I don't think 
they yet have the code that Hudi guys contributed to presto i.e. the split 
annotations etc. Also they don’t have Hudi jars in presto classpath. We are not 
sure of any timelines for this support, but I have heard that work should start 
soon.

Thanks,
Udit

On 2/18/20, 11:27 AM, "Vinoth Chandar" <vin...@apache.org> wrote:

    Thanks everyone for chiming in. Esp Gary for the detailed workaround..
    (should we FAQ this workaround.. food for thought)
    
    >> if I connect to the Hive catalog on EMR, which is able to provide the
    Hudi views correctly, I should be able to get correct results on Athena
    
    Knowing how the Presto/Hudi integration works, simply being able to read
    from Hive metastore is not enough. Presto has code to specially recognize
    Hudi tables and does an additional filtering step, which lets it query the
    data in there correctly. (Gary's workaround above keeps just 1 version
    around for a given file (group))..
    
    On Mon, Feb 17, 2020 at 11:28 PM Gary Li <yanjia.gary...@gmail.com> wrote:
    
    > Hello, I don't have any experience working with Athena but I can share my
    > experience working with Impala. There is a workaround.
    > By setting Hudi config:
    >
    >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
    >    - hoodie.cleaner.fileversions.retained=1
    >
    > You will have your Hudi dataset as same as plain parquet files. You can
    > create a table just like regular parquet. Hudi will write a new commit
    > first then delete the older files that have two versions. You need to
    > refresh the table metadata store as soon as the Hudi Upsert job finishes.
    > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the older
    > files and before refresh the table metastore, the table will be 
unavailable
    > for query(1-5 mins in my case).
    >
    > How can we process S3 parquet files(hourly partitioned) through Apache
    > Hudi? Is there any streaming layer we need to introduce?
    > -----------
    > Hudi Delta streamer support parquet file. You can do a bulkInsert for the
    > first job then use delta streamer for the Upsert job.
    >
    > 3 - What should be the parquet file size and row group size for better
    > performance on querying Hudi Dataset?
    > ----------
    > That depends on the query engine you are using and it should be documented
    > somewhere. For impala, the optimal size for query performance is 256MB, 
but
    > the larger file size will make upsert more expensive. The size I 
personally
    > choose is 100MB to 128MB.
    >
    > Thanks,
    > Gary
    >
    >
    >
    > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu <raghu...@amazon.com.invalid>
    > wrote:
    >
    > > Athena is indeed Presto inside, but there is lot of custom code which 
has
    > > gone on top of Presto there.
    > > Couple months back I tried running a glue crawler to catalog a Hudi data
    > > set and then query it from Athena. The results were not same as what I
    > > would get with running the same query using spark SQL on EMR. Did not 
try
    > > Presto on EMR, but assuming it will work fine on EMR.
    > >
    > > Athena integration with Hudi data set is planned shortly, but not sure 
of
    > > the date yet.
    > >
    > > However, recently Athena started supporting integration to a Hive 
catalog
    > > apart from Glue. What that means is in Athena, if I connect to the Hive
    > > catalog on EMR, which is able to provide the Hudi views correctly, I
    > should
    > > be able to get correct results on Athena. Have not tested it though. The
    > > feature is in Preview already.
    > >
    > > Thanks
    > > Raghu
    > > -----Original Message-----
    > > From: Shiyan Xu <xu.shiyan.raym...@gmail.com>
    > > Sent: Tuesday, February 18, 2020 6:20 AM
    > > To: dev@hudi.apache.org
    > > Cc: Mehrotra, Udit <udi...@amazon.com>; Raghvendra Dhar Dubey
    > > <raghvendra.d.du...@delhivery.com.invalid>
    > > Subject: Re: Apache Hudi on AWS EMR
    > >
    > > For 2) I think running presto on EMR is able to let you run
    > read-optimized
    > > queries.
    > > I don't quite understand how exactly Athena not support Hudi as it is
    > > Presto underlying.
    > > Perhaps @Udit could give some insights from AWS?
    > >
    > > As @Raghvendra you mentioned, another option is to export Hudi dataset 
to
    > > plain parquet files for Athena to query on
    > > RFC-9 is for this usecase
    > >
    > >
    > 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
    > > The task is inactive now. Feel free to pick up if this is something 
you'd
    > > like to work on. I'd be happy to help with that.
    > >
    > >
    > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vin...@apache.org>
    > wrote:
    > >
    > > > Hi Raghvendra,
    > > >
    > > > Quick sidebar.. Please subscribe to the mailing list, so your message
    > > > get published automatically. :)
    > > >
    > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
    > > > <raghvendra.d.du...@delhivery.com.invalid> wrote:
    > > >
    > > > > Hi Udit,
    > > > >
    > > > > Thanks for information.
    > > > > Actually I am struggling on following points
    > > > > 1 - How can we process S3 parquet files(hourly partitioned) through
    > > > Apache
    > > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
    > > > > there any workaround to query Hudi Dataset from Athena? we are
    > > > > thinking to dump resulting Hudi dataset to S3, and then querying
    > > > > from Athena. 3 - What should be the parquet file size and row group
    > > > > size for better performance on querying Hudi Dataset?
    > > > >
    > > > > Thanks
    > > > > Raghvendra
    > > > >
    > > > >
    > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <udi...@amazon.com>
    > > > wrote:
    > > > >
    > > > > > Hi Raghvendra,
    > > > > >
    > > > > > You would have to re-write you Parquet Dataset in Hudi format.
    > > > > > Here are the links you can follow to get started:
    > > > > >
    > > > > >
    > > > >
    > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
    > > > -dataset.html
    > > > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
    > > > > >
    > > > > > Thanks,
    > > > > > Udit
    > > > > >
    > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
    > > > > > <raghvendra.d.du...@delhivery.com.INVALID> wrote:
    > > > > >
    > > > > >     Hi Team,
    > > > > >
    > > > > >     I want to setup incremental view of my AWS S3 parquet data
    > > > > > through Apache
    > > > > >     Hudi, and want to query this data through Athena, but
    > > > > > currently
    > > > > Athena
    > > > > > not
    > > > > >     supporting Hudi Dataset.
    > > > > >
    > > > > >     so there are few questions which I want to understand here
    > > > > >
    > > > > >     1 - How to stream s3 parquet file to Hudi dataset running on
    > EMR.
    > > > > >
    > > > > >     2 - How to query Hudi Dataset running on EMR
    > > > > >
    > > > > >     Please help me to understand this.
    > > > > >
    > > > > >     Thanks
    > > > > >
    > > > > >     Raghvendra
    > > > > >
    > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >

Re: Apache Hudi on AWS EMR

Reply via email to