Re: Apache Hudi on AWS EMR

Raghvendra Dhar Dubey Thu, 13 Feb 2020 17:33:13 -0800

Hi Udit,

Thanks for information.
Actually I am struggling on following points
1 - How can we process S3 parquet files(hourly partitioned) through Apache
Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
workaround to query Hudi Dataset from Athena? we are thinking to dump
resulting Hudi dataset to S3, and then querying from Athena. 3 - What
should be the parquet file size and row group size for better performance
on querying Hudi Dataset?


Thanks
Raghvendra


On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <udi...@amazon.com> wrote:

> Hi Raghvendra,
>
> You would have to re-write you Parquet Dataset in Hudi format. Here are
> the links you can follow to get started:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
>
> Thanks,
> Udit
>
> On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> <raghvendra.d.du...@delhivery.com.INVALID> wrote:
>
>     Hi Team,
>
>     I want to setup incremental view of my AWS S3 parquet data through
> Apache
>     Hudi, and want to query this data through Athena, but currently Athena
> not
>     supporting Hudi Dataset.
>
>     so there are few questions which I want to understand here
>
>     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
>
>     2 - How to query Hudi Dataset running on EMR
>
>     Please help me to understand this.
>
>     Thanks
>
>     Raghvendra
>
>
>

Re: Apache Hudi on AWS EMR

Reply via email to