I actually understood more about your use-case also now, Raymond! thanks for the response!
On Fri, Mar 6, 2020 at 7:02 PM Shiyan Xu <[email protected]> wrote: > I can answer this as my team faces exactly the same problems. > We recently sync'ed up with AWS EMR team and got some directions. > > Hudi dataset <> Glue > An interim approach is needed: configure S3 notification to detect new > commit file after each compaction, upon the notification update an manifest > file for Glue to update > This is some workaround before Athena officially support Hudi dataset > > Athena support > This is planned but no definite timeline given. High level approach is > use Athena > Hive external metadata store > < > https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hive.html > > > but > Athena needs some changes to adapt to Hudi dataset > > The considerations from my team is: the interim approach should work nicely > but require additional operational efforts. > We have an alternative plan of using the new feature of Hudi snapshot > exporter (https://issues.apache.org/jira/browse/HUDI-344) which is about > to > be merged. > It helps exporting Hudi dataset to plain parquet files and work natively > with Athena or Glue. We don't have very low latency requirements at the > moment so periodic export works for us. > The feature should be available in 6.0 but the class can be used as a > standalone tool. > > On Fri, Mar 6, 2020 at 6:26 PM Sanchez, Jorge > <[email protected]> wrote: > > > Hi Vinoth, > > > > Thanks for the reply, our design is to utilize Glue for ETL processing. > We > > would have to support both real time IOT data and batch ETL flows ( jdbc > > source and static files like csv ). > > The access layer would be through the presto cluster which would be > > running on EC2 within AWS environment. > > > > We would like to utilize the historization of the data as it is one of > the > > requirements. My impression is that the Hudi is getting lot of attention > > from AWS as it is now mainstreamed into EMR, what I don't see is the use > > cases using the Glue environment - all the documentation mentions the > EMR. > > > > My questions would be: > > * how difficult would be to have the Hudi integrated to AWS Glue > > * is the Glue metadata catalog fully supported for Hudi tables > > * is the Glue crawler able to crawler and catalog the Hudi tables > > * is there any plan for the Athena to support access to Hudi tables in > the > > future > > > > I understand that these question should be addressed to the AWS guys, > > hoping that there are some of them on this channel. > > > > Regards, > > > > Jorge > > > > -----Original Message----- > > From: Vinoth Chandar <[email protected]> > > Sent: Friday, March 6, 2020 6:43 PM > > To: [email protected] > > Subject: Re: running Hudi in AWS Glue Spark > > > > EXTERNAL EMAIL – Use caution with any links or file attachments. > > > > https://aws.amazon.com/emr/features/hudi/ mentions that its integrated > > with the glue catalog. > > > > It should be similar to other datasources you use on Glue IIUC.. I have > > seen users talk about this on slack (IIRC).. > > Are you running into specific issues we can help with? May be the AWS > > folks here can chime in more? > > > > On Fri, Mar 6, 2020 at 3:47 AM Sanchez, Jorge <[email protected] > .invalid> > > wrote: > > > > > Hello, > > > > > > Did anybody tried to run Hudi within AWS Glue job, I searched the JIRA > > > issues but did not find anybody mentioning that. > > > > > > > > > Thanks, > > > > > > Jorge > > > Notice: This e-mail message, together with any attachments, contains > > > information of Merck & Co., Inc. (2000 Galloping Hill Road, > > > Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct > > > contact information for affiliates is available at > > > http://www.merck.com/contact/contacts.html) that may be confidential, > > > proprietary copyrighted and/or legally privileged. It is intended > > > solely for the use of the individual or entity named on this message. > > > If you are not the intended recipient, and have received this message > > > in error, please notify us immediately by reply e-mail and then delete > > > it from your system. > > > > > Notice: This e-mail message, together with any attachments, contains > > information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, > > New Jersey, USA 07033), and/or its affiliates Direct contact information > > for affiliates is available at > > http://www.merck.com/contact/contacts.html) that may be confidential, > > proprietary copyrighted and/or legally privileged. It is intended solely > > for the use of the individual or entity named on this message. If you are > > not the intended recipient, and have received this message in error, > > please notify us immediately by reply e-mail and then delete it from > > your system. > > >
