Re: Apache Hudi on AWS EMR

Vinoth Chandar Thu, 13 Feb 2020 17:39:13 -0800

Hi Raghvendra,

Quick sidebar.. Please subscribe to the mailing list, so your message get
published automatically. :)


On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
<[email protected]> wrote:

> Hi Udit,
>
> Thanks for information.
> Actually I am struggling on following points
> 1 - How can we process S3 parquet files(hourly partitioned) through Apache
> Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
> workaround to query Hudi Dataset from Athena? we are thinking to dump
> resulting Hudi dataset to S3, and then querying from Athena. 3 - What
> should be the parquet file size and row group size for better performance
> on querying Hudi Dataset?
>
> Thanks
> Raghvendra
>
>
> On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <[email protected]> wrote:
>
> > Hi Raghvendra,
> >
> > You would have to re-write you Parquet Dataset in Hudi format. Here are
> > the links you can follow to get started:
> >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> >
> > Thanks,
> > Udit
> >
> > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > <[email protected]> wrote:
> >
> >     Hi Team,
> >
> >     I want to setup incremental view of my AWS S3 parquet data through
> > Apache
> >     Hudi, and want to query this data through Athena, but currently
> Athena
> > not
> >     supporting Hudi Dataset.
> >
> >     so there are few questions which I want to understand here
> >
> >     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> >
> >     2 - How to query Hudi Dataset running on EMR
> >
> >     Please help me to understand this.
> >
> >     Thanks
> >
> >     Raghvendra
> >
> >
> >
>

Re: Apache Hudi on AWS EMR

Reply via email to