So, it'd buy you the incremental merge, with an additional cost of file copy for the entire table.. Interesting.. Hudi-Athena is a hot ask.. :)
On Mon, Mar 9, 2020 at 2:39 AM Sanchez, Jorge <[email protected]> wrote: > Hello, > > Thanks for the reply, definitely a good idea to have the data exported via > the snapshot exporter to plain parquet file and exposing it to Athena. I > will take this to my team to see what they think. > > Regards, > > Jorge > > -----Original Message----- > From: Vinoth Chandar <[email protected]> > Sent: Monday, March 9, 2020 7:03 AM > To: [email protected] > Subject: Re: running Hudi in AWS Glue Spark > > EXTERNAL EMAIL – Use caution with any links or file attachments. > > I actually understood more about your use-case also now, Raymond! thanks > for the response! > > On Fri, Mar 6, 2020 at 7:02 PM Shiyan Xu <[email protected]> > wrote: > > > I can answer this as my team faces exactly the same problems. > > We recently sync'ed up with AWS EMR team and got some directions. > > > > Hudi dataset <> Glue > > An interim approach is needed: configure S3 notification to detect new > > commit file after each compaction, upon the notification update an > > manifest file for Glue to update This is some workaround before Athena > > officially support Hudi dataset > > > > Athena support > > This is planned but no definite timeline given. High level approach is > > use Athena Hive external metadata store < > > https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hi > > ve.html > > > > > but > > Athena needs some changes to adapt to Hudi dataset > > > > The considerations from my team is: the interim approach should work > > nicely but require additional operational efforts. > > We have an alternative plan of using the new feature of Hudi snapshot > > exporter (https://issues.apache.org/jira/browse/HUDI-344) which is > > about to be merged. > > It helps exporting Hudi dataset to plain parquet files and work > > natively with Athena or Glue. We don't have very low latency > > requirements at the moment so periodic export works for us. > > The feature should be available in 6.0 but the class can be used as a > > standalone tool. > > > > On Fri, Mar 6, 2020 at 6:26 PM Sanchez, Jorge > > <[email protected]> wrote: > > > > > Hi Vinoth, > > > > > > Thanks for the reply, our design is to utilize Glue for ETL processing. > > We > > > would have to support both real time IOT data and batch ETL flows ( > > > jdbc source and static files like csv ). > > > The access layer would be through the presto cluster which would be > > > running on EC2 within AWS environment. > > > > > > We would like to utilize the historization of the data as it is one > > > of > > the > > > requirements. My impression is that the Hudi is getting lot of > > > attention from AWS as it is now mainstreamed into EMR, what I don't > > > see is the use cases using the Glue environment - all the > > > documentation mentions the > > EMR. > > > > > > My questions would be: > > > * how difficult would be to have the Hudi integrated to AWS Glue > > > * is the Glue metadata catalog fully supported for Hudi tables > > > * is the Glue crawler able to crawler and catalog the Hudi tables > > > * is there any plan for the Athena to support access to Hudi tables > > > in > > the > > > future > > > > > > I understand that these question should be addressed to the AWS > > > guys, hoping that there are some of them on this channel. > > > > > > Regards, > > > > > > Jorge > > > > > > -----Original Message----- > > > From: Vinoth Chandar <[email protected]> > > > Sent: Friday, March 6, 2020 6:43 PM > > > To: [email protected] > > > Subject: Re: running Hudi in AWS Glue Spark > > > > > > EXTERNAL EMAIL – Use caution with any links or file attachments. > > > > > > https://aws.amazon.com/emr/features/hudi/ mentions that its > > > integrated with the glue catalog. > > > > > > It should be similar to other datasources you use on Glue IIUC.. I > > > have seen users talk about this on slack (IIRC).. > > > Are you running into specific issues we can help with? May be the > > > AWS folks here can chime in more? > > > > > > On Fri, Mar 6, 2020 at 3:47 AM Sanchez, Jorge > > > <[email protected] > > .invalid> > > > wrote: > > > > > > > Hello, > > > > > > > > Did anybody tried to run Hudi within AWS Glue job, I searched the > > > > JIRA issues but did not find anybody mentioning that. > > > > > > > > > > > > Thanks, > > > > > > > > Jorge > > > > Notice: This e-mail message, together with any attachments, > > > > contains information of Merck & Co., Inc. (2000 Galloping Hill > > > > Road, Kenilworth, New Jersey, USA 07033), and/or its affiliates > > > > Direct contact information for affiliates is available at > > > > http://www.merck.com/contact/contacts.html) that may be > > > > confidential, proprietary copyrighted and/or legally privileged. > > > > It is intended solely for the use of the individual or entity named > on this message. > > > > If you are not the intended recipient, and have received this > > > > message in error, please notify us immediately by reply e-mail and > > > > then delete it from your system. > > > > > > > Notice: This e-mail message, together with any attachments, > > > contains information of Merck & Co., Inc. (2000 Galloping Hill Road, > > > Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct > > > contact information for affiliates is available at > > > http://www.merck.com/contact/contacts.html) that may be > > > confidential, proprietary copyrighted and/or legally privileged. It > > > is intended solely for the use of the individual or entity named on > > > this message. If you are not the intended recipient, and have > > > received this message in error, please notify us immediately by > > > reply e-mail and then delete it from your system. > > > > > > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, > New Jersey, USA 07033), and/or its affiliates Direct contact information > for affiliates is available at > http://www.merck.com/contact/contacts.html) that may be confidential, > proprietary copyrighted and/or legally privileged. It is intended solely > for the use of the individual or entity named on this message. If you are > not the intended recipient, and have received this message in error, > please notify us immediately by reply e-mail and then delete it from > your system. >
