Hi Udit, Thanks for information. Actually I am struggling on following points 1 - How can we process S3 parquet files(hourly partitioned) through Apache Hudi? Is there any streaming layer we need to introduce? 2 - Is there any workaround to query Hudi Dataset from Athena? we are thinking to dump resulting Hudi dataset to S3, and then querying from Athena. 3 - What should be the parquet file size and row group size for better performance on querying Hudi Dataset?
Thanks Raghvendra On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <udi...@amazon.com> wrote: > Hi Raghvendra, > > You would have to re-write you Parquet Dataset in Hudi format. Here are > the links you can follow to get started: > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull > > Thanks, > Udit > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey" > <raghvendra.d.du...@delhivery.com.INVALID> wrote: > > Hi Team, > > I want to setup incremental view of my AWS S3 parquet data through > Apache > Hudi, and want to query this data through Athena, but currently Athena > not > supporting Hudi Dataset. > > so there are few questions which I want to understand here > > 1 - How to stream s3 parquet file to Hudi dataset running on EMR. > > 2 - How to query Hudi Dataset running on EMR > > Please help me to understand this. > > Thanks > > Raghvendra > > >