I can answer this as my team faces exactly the same problems.
We recently sync'ed up with AWS EMR team and got some directions.

Hudi dataset <> Glue
An interim approach is needed: configure S3 notification to detect new
commit file after each compaction, upon the notification update an manifest
file for Glue to update
This is some workaround before Athena officially support Hudi dataset

Athena support
This is planned but no definite timeline given. High level approach is
use Athena
Hive external metadata store
<https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hive.html>
but
Athena needs some changes to adapt to Hudi dataset

The considerations from my team is: the interim approach should work nicely
but require additional operational efforts.
We have an alternative plan of using the new feature of Hudi snapshot
exporter (https://issues.apache.org/jira/browse/HUDI-344) which is about to
be merged.
It helps exporting Hudi dataset to plain parquet files and work natively
with Athena or Glue. We don't have very low latency requirements at the
moment so periodic export works for us.
The feature should be available in 6.0 but the class can be used as a
standalone tool.

On Fri, Mar 6, 2020 at 6:26 PM Sanchez, Jorge
<[email protected]> wrote:

> Hi Vinoth,
>
> Thanks for the reply, our design is to utilize Glue for ETL processing. We
> would have to support both real time IOT data and batch ETL flows ( jdbc
> source and static files like csv ).
> The access layer would be through the presto cluster which would be
> running on EC2 within AWS environment.
>
> We would like to utilize the historization of the data as it is one of the
> requirements. My impression is that the Hudi is getting lot of attention
> from AWS as it is now mainstreamed into EMR, what I don't see is the use
> cases using the Glue environment - all the documentation mentions the EMR.
>
> My questions would be:
> * how difficult would be to have the Hudi integrated to AWS Glue
> * is the Glue metadata catalog fully supported for Hudi tables
> * is the Glue crawler able to crawler and catalog the Hudi tables
> * is there any plan for the Athena to support access to Hudi tables in the
> future
>
> I understand that these question should be addressed to the AWS guys,
> hoping that there are some of them on this channel.
>
> Regards,
>
> Jorge
>
> -----Original Message-----
> From: Vinoth Chandar <[email protected]>
> Sent: Friday, March 6, 2020 6:43 PM
> To: [email protected]
> Subject: Re: running Hudi in AWS Glue Spark
>
> EXTERNAL EMAIL – Use caution with any links or file attachments.
>
> https://aws.amazon.com/emr/features/hudi/ mentions that its integrated
> with the glue catalog.
>
> It should be similar to other datasources you use on Glue IIUC.. I have
> seen users talk about this on slack (IIRC)..
> Are you running into specific issues we can help with? May be the AWS
> folks here can chime in more?
>
> On Fri, Mar 6, 2020 at 3:47 AM Sanchez, Jorge 
> <[email protected]>
> wrote:
>
> > Hello,
> >
> > Did anybody tried to run Hudi within AWS Glue job, I searched the JIRA
> > issues but did not find anybody mentioning that.
> >
> >
> > Thanks,
> >
> > Jorge
> > Notice:  This e-mail message, together with any attachments, contains
> > information of Merck & Co., Inc. (2000 Galloping Hill Road,
> > Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct
> > contact information for affiliates is available at
> > http://www.merck.com/contact/contacts.html) that may be confidential,
> > proprietary copyrighted and/or legally privileged. It is intended
> > solely for the use of the individual or entity named on this message.
> > If you are not the intended recipient, and have received this message
> > in error, please notify us immediately by reply e-mail and then delete
> > it from your system.
> >
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
> New Jersey, USA 07033), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>

Reply via email to