Hello, Thanks for the reply, definitely a good idea to have the data exported via the snapshot exporter to plain parquet file and exposing it to Athena. I will take this to my team to see what they think.
Regards, Jorge -----Original Message----- From: Vinoth Chandar <[email protected]> Sent: Monday, March 9, 2020 7:03 AM To: [email protected] Subject: Re: running Hudi in AWS Glue Spark EXTERNAL EMAIL – Use caution with any links or file attachments. I actually understood more about your use-case also now, Raymond! thanks for the response! On Fri, Mar 6, 2020 at 7:02 PM Shiyan Xu <[email protected]> wrote: > I can answer this as my team faces exactly the same problems. > We recently sync'ed up with AWS EMR team and got some directions. > > Hudi dataset <> Glue > An interim approach is needed: configure S3 notification to detect new > commit file after each compaction, upon the notification update an > manifest file for Glue to update This is some workaround before Athena > officially support Hudi dataset > > Athena support > This is planned but no definite timeline given. High level approach is > use Athena Hive external metadata store < > https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hi > ve.html > > > but > Athena needs some changes to adapt to Hudi dataset > > The considerations from my team is: the interim approach should work > nicely but require additional operational efforts. > We have an alternative plan of using the new feature of Hudi snapshot > exporter (https://issues.apache.org/jira/browse/HUDI-344) which is > about to be merged. > It helps exporting Hudi dataset to plain parquet files and work > natively with Athena or Glue. We don't have very low latency > requirements at the moment so periodic export works for us. > The feature should be available in 6.0 but the class can be used as a > standalone tool. > > On Fri, Mar 6, 2020 at 6:26 PM Sanchez, Jorge > <[email protected]> wrote: > > > Hi Vinoth, > > > > Thanks for the reply, our design is to utilize Glue for ETL processing. > We > > would have to support both real time IOT data and batch ETL flows ( > > jdbc source and static files like csv ). > > The access layer would be through the presto cluster which would be > > running on EC2 within AWS environment. > > > > We would like to utilize the historization of the data as it is one > > of > the > > requirements. My impression is that the Hudi is getting lot of > > attention from AWS as it is now mainstreamed into EMR, what I don't > > see is the use cases using the Glue environment - all the > > documentation mentions the > EMR. > > > > My questions would be: > > * how difficult would be to have the Hudi integrated to AWS Glue > > * is the Glue metadata catalog fully supported for Hudi tables > > * is the Glue crawler able to crawler and catalog the Hudi tables > > * is there any plan for the Athena to support access to Hudi tables > > in > the > > future > > > > I understand that these question should be addressed to the AWS > > guys, hoping that there are some of them on this channel. > > > > Regards, > > > > Jorge > > > > -----Original Message----- > > From: Vinoth Chandar <[email protected]> > > Sent: Friday, March 6, 2020 6:43 PM > > To: [email protected] > > Subject: Re: running Hudi in AWS Glue Spark > > > > EXTERNAL EMAIL – Use caution with any links or file attachments. > > > > https://aws.amazon.com/emr/features/hudi/ mentions that its > > integrated with the glue catalog. > > > > It should be similar to other datasources you use on Glue IIUC.. I > > have seen users talk about this on slack (IIRC).. > > Are you running into specific issues we can help with? May be the > > AWS folks here can chime in more? > > > > On Fri, Mar 6, 2020 at 3:47 AM Sanchez, Jorge > > <[email protected] > .invalid> > > wrote: > > > > > Hello, > > > > > > Did anybody tried to run Hudi within AWS Glue job, I searched the > > > JIRA issues but did not find anybody mentioning that. > > > > > > > > > Thanks, > > > > > > Jorge > > > Notice: This e-mail message, together with any attachments, > > > contains information of Merck & Co., Inc. (2000 Galloping Hill > > > Road, Kenilworth, New Jersey, USA 07033), and/or its affiliates > > > Direct contact information for affiliates is available at > > > http://www.merck.com/contact/contacts.html) that may be > > > confidential, proprietary copyrighted and/or legally privileged. > > > It is intended solely for the use of the individual or entity named on > > > this message. > > > If you are not the intended recipient, and have received this > > > message in error, please notify us immediately by reply e-mail and > > > then delete it from your system. > > > > > Notice: This e-mail message, together with any attachments, > > contains information of Merck & Co., Inc. (2000 Galloping Hill Road, > > Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct > > contact information for affiliates is available at > > http://www.merck.com/contact/contacts.html) that may be > > confidential, proprietary copyrighted and/or legally privileged. It > > is intended solely for the use of the individual or entity named on > > this message. If you are not the intended recipient, and have > > received this message in error, please notify us immediately by > > reply e-mail and then delete it from your system. > > > Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.
