So, it'd buy you the incremental merge, with an additional cost of file
copy for the entire table.. Interesting..
Hudi-Athena is a hot ask.. :)

On Mon, Mar 9, 2020 at 2:39 AM Sanchez, Jorge
<[email protected]> wrote:

> Hello,
>
> Thanks for the reply, definitely a good idea to have the data exported via
> the snapshot exporter to plain parquet file and exposing it to Athena. I
> will take this to my team to see what they think.
>
> Regards,
>
> Jorge
>
> -----Original Message-----
> From: Vinoth Chandar <[email protected]>
> Sent: Monday, March 9, 2020 7:03 AM
> To: [email protected]
> Subject: Re: running Hudi in AWS Glue Spark
>
> EXTERNAL EMAIL – Use caution with any links or file attachments.
>
> I actually understood more about your use-case also now, Raymond! thanks
> for the response!
>
> On Fri, Mar 6, 2020 at 7:02 PM Shiyan Xu <[email protected]>
> wrote:
>
> > I can answer this as my team faces exactly the same problems.
> > We recently sync'ed up with AWS EMR team and got some directions.
> >
> > Hudi dataset <> Glue
> > An interim approach is needed: configure S3 notification to detect new
> > commit file after each compaction, upon the notification update an
> > manifest file for Glue to update This is some workaround before Athena
> > officially support Hudi dataset
> >
> > Athena support
> > This is planned but no definite timeline given. High level approach is
> > use Athena Hive external metadata store <
> > https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hi
> > ve.html
> > >
> > but
> > Athena needs some changes to adapt to Hudi dataset
> >
> > The considerations from my team is: the interim approach should work
> > nicely but require additional operational efforts.
> > We have an alternative plan of using the new feature of Hudi snapshot
> > exporter (https://issues.apache.org/jira/browse/HUDI-344) which is
> > about to be merged.
> > It helps exporting Hudi dataset to plain parquet files and work
> > natively with Athena or Glue. We don't have very low latency
> > requirements at the moment so periodic export works for us.
> > The feature should be available in 6.0 but the class can be used as a
> > standalone tool.
> >
> > On Fri, Mar 6, 2020 at 6:26 PM Sanchez, Jorge
> > <[email protected]> wrote:
> >
> > > Hi Vinoth,
> > >
> > > Thanks for the reply, our design is to utilize Glue for ETL processing.
> > We
> > > would have to support both real time IOT data and batch ETL flows (
> > > jdbc source and static files like csv ).
> > > The access layer would be through the presto cluster which would be
> > > running on EC2 within AWS environment.
> > >
> > > We would like to utilize the historization of the data as it is one
> > > of
> > the
> > > requirements. My impression is that the Hudi is getting lot of
> > > attention from AWS as it is now mainstreamed into EMR, what I don't
> > > see is the use cases using the Glue environment - all the
> > > documentation mentions the
> > EMR.
> > >
> > > My questions would be:
> > > * how difficult would be to have the Hudi integrated to AWS Glue
> > > * is the Glue metadata catalog fully supported for Hudi tables
> > > * is the Glue crawler able to crawler and catalog the Hudi tables
> > > * is there any plan for the Athena to support access to Hudi tables
> > > in
> > the
> > > future
> > >
> > > I understand that these question should be addressed to the AWS
> > > guys, hoping that there are some of them on this channel.
> > >
> > > Regards,
> > >
> > > Jorge
> > >
> > > -----Original Message-----
> > > From: Vinoth Chandar <[email protected]>
> > > Sent: Friday, March 6, 2020 6:43 PM
> > > To: [email protected]
> > > Subject: Re: running Hudi in AWS Glue Spark
> > >
> > > EXTERNAL EMAIL – Use caution with any links or file attachments.
> > >
> > > https://aws.amazon.com/emr/features/hudi/ mentions that its
> > > integrated with the glue catalog.
> > >
> > > It should be similar to other datasources you use on Glue IIUC.. I
> > > have seen users talk about this on slack (IIRC)..
> > > Are you running into specific issues we can help with? May be the
> > > AWS folks here can chime in more?
> > >
> > > On Fri, Mar 6, 2020 at 3:47 AM Sanchez, Jorge
> > > <[email protected]
> > .invalid>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > Did anybody tried to run Hudi within AWS Glue job, I searched the
> > > > JIRA issues but did not find anybody mentioning that.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Jorge
> > > > Notice:  This e-mail message, together with any attachments,
> > > > contains information of Merck & Co., Inc. (2000 Galloping Hill
> > > > Road, Kenilworth, New Jersey, USA 07033), and/or its affiliates
> > > > Direct contact information for affiliates is available at
> > > > http://www.merck.com/contact/contacts.html) that may be
> > > > confidential, proprietary copyrighted and/or legally privileged.
> > > > It is intended solely for the use of the individual or entity named
> on this message.
> > > > If you are not the intended recipient, and have received this
> > > > message in error, please notify us immediately by reply e-mail and
> > > > then delete it from your system.
> > > >
> > > Notice:  This e-mail message, together with any attachments,
> > > contains information of Merck & Co., Inc. (2000 Galloping Hill Road,
> > > Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct
> > > contact information for affiliates is available at
> > > http://www.merck.com/contact/contacts.html) that may be
> > > confidential, proprietary copyrighted and/or legally privileged. It
> > > is intended solely for the use of the individual or entity named on
> > > this message. If you are not the intended recipient, and have
> > > received this message in error, please notify us immediately by
> > > reply e-mail and then delete it from your system.
> > >
> >
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
> New Jersey, USA 07033), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>

Reply via email to