Re: Apache Hudi on AWS EMR

Bhavani Sudha Saktheeswaran Wed, 19 Feb 2020 16:02:23 -0800

Got it. Thanks Udit!

On Wed, Feb 19, 2020 at 2:12 PM Mehrotra, Udit <[email protected]>
wrote:


> Hi Sudha,
>
> Yes EMR Presto since 5.28.0 release comes with presto jars present in the
> classpath. If you launch a cluster with Presto you should see it at:
>
> /usr/lib/presto/plugin/hive-hadoop2/hudi-presto-bundle.jar
>
> Thanks,
> Udit
>
>
> On 2/19/20, 1:53 PM, "Bhavani Sudha" <[email protected]> wrote:
>
>     Hi Udit,
>
>     Just a quick question on Presto EMR. Does EMR Presto support Hudi jars
> in
>     its classpath ?
>
>     On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit
> <[email protected]>
>     wrote:
>
>     > Workaround provided by Gary can help querying Hudi tables through
> Athena
>     > for Copy On Write tables by basically querying only the latest
> commit files
>     > as standard parquet. It would definitely be worth documenting, as
> several
>     > people have asked for it and I remember providing the same
> suggestion on
>     > slack earlier. I can add if I have the perms.
>     >
>     > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
>     >     Hudi views correctly, I should be able to get correct results on
> Athena
>     >
>     > As Vinoth mentioned, just connecting to metastore is not enough.
> Athena
>     > would still use its own Presto which does not support Hudi.
>     >
>     > As for Hudi support for Athena:
>     > Athena does use Presto, but it's their own custom version and I don't
>     > think they yet have the code that Hudi guys contributed to presto
> i.e. the
>     > split annotations etc. Also they don’t have Hudi jars in presto
> classpath.
>     > We are not sure of any timelines for this support, but I have heard
> that
>     > work should start soon.
>     >
>     > Thanks,
>     > Udit
>     >
>     > On 2/18/20, 11:27 AM, "Vinoth Chandar" <[email protected]> wrote:
>     >
>     >     Thanks everyone for chiming in. Esp Gary for the detailed
> workaround..
>     >     (should we FAQ this workaround.. food for thought)
>     >
>     >     >> if I connect to the Hive catalog on EMR, which is able to
> provide
>     > the
>     >     Hudi views correctly, I should be able to get correct results on
> Athena
>     >
>     >     Knowing how the Presto/Hudi integration works, simply being able
> to
>     > read
>     >     from Hive metastore is not enough. Presto has code to specially
>     > recognize
>     >     Hudi tables and does an additional filtering step, which lets it
> query
>     > the
>     >     data in there correctly. (Gary's workaround above keeps just 1
> version
>     >     around for a given file (group))..
>     >
>     >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <
> [email protected]>
>     > wrote:
>     >
>     >     > Hello, I don't have any experience working with Athena but I
> can
>     > share my
>     >     > experience working with Impala. There is a workaround.
>     >     > By setting Hudi config:
>     >     >
>     >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
>     >     >    - hoodie.cleaner.fileversions.retained=1
>     >     >
>     >     > You will have your Hudi dataset as same as plain parquet
> files. You
>     > can
>     >     > create a table just like regular parquet. Hudi will write a new
>     > commit
>     >     > first then delete the older files that have two versions. You
> need to
>     >     > refresh the table metadata store as soon as the Hudi Upsert job
>     > finishes.
>     >     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed
> the
>     > older
>     >     > files and before refresh the table metastore, the table will be
>     > unavailable
>     >     > for query(1-5 mins in my case).
>     >     >
>     >     > How can we process S3 parquet files(hourly partitioned) through
>     > Apache
>     >     > Hudi? Is there any streaming layer we need to introduce?
>     >     > -----------
>     >     > Hudi Delta streamer support parquet file. You can do a
> bulkInsert
>     > for the
>     >     > first job then use delta streamer for the Upsert job.
>     >     >
>     >     > 3 - What should be the parquet file size and row group size for
>     > better
>     >     > performance on querying Hudi Dataset?
>     >     > ----------
>     >     > That depends on the query engine you are using and it should be
>     > documented
>     >     > somewhere. For impala, the optimal size for query performance
> is
>     > 256MB, but
>     >     > the larger file size will make upsert more expensive. The size
> I
>     > personally
>     >     > choose is 100MB to 128MB.
>     >     >
>     >     > Thanks,
>     >     > Gary
>     >     >
>     >     >
>     >     >
>     >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
>     > <[email protected]>
>     >     > wrote:
>     >     >
>     >     > > Athena is indeed Presto inside, but there is lot of custom
> code
>     > which has
>     >     > > gone on top of Presto there.
>     >     > > Couple months back I tried running a glue crawler to catalog
> a
>     > Hudi data
>     >     > > set and then query it from Athena. The results were not same
> as
>     > what I
>     >     > > would get with running the same query using spark SQL on
> EMR. Did
>     > not try
>     >     > > Presto on EMR, but assuming it will work fine on EMR.
>     >     > >
>     >     > > Athena integration with Hudi data set is planned shortly,
> but not
>     > sure of
>     >     > > the date yet.
>     >     > >
>     >     > > However, recently Athena started supporting integration to a
> Hive
>     > catalog
>     >     > > apart from Glue. What that means is in Athena, if I connect
> to the
>     > Hive
>     >     > > catalog on EMR, which is able to provide the Hudi views
> correctly,
>     > I
>     >     > should
>     >     > > be able to get correct results on Athena. Have not tested it
>     > though. The
>     >     > > feature is in Preview already.
>     >     > >
>     >     > > Thanks
>     >     > > Raghu
>     >     > > -----Original Message-----
>     >     > > From: Shiyan Xu <[email protected]>
>     >     > > Sent: Tuesday, February 18, 2020 6:20 AM
>     >     > > To: [email protected]
>     >     > > Cc: Mehrotra, Udit <[email protected]>; Raghvendra Dhar
> Dubey
>     >     > > <[email protected]>
>     >     > > Subject: Re: Apache Hudi on AWS EMR
>     >     > >
>     >     > > For 2) I think running presto on EMR is able to let you run
>     >     > read-optimized
>     >     > > queries.
>     >     > > I don't quite understand how exactly Athena not support Hudi
> as it
>     > is
>     >     > > Presto underlying.
>     >     > > Perhaps @Udit could give some insights from AWS?
>     >     > >
>     >     > > As @Raghvendra you mentioned, another option is to export
> Hudi
>     > dataset to
>     >     > > plain parquet files for Athena to query on
>     >     > > RFC-9 is for this usecase
>     >     > >
>     >     > >
>     >     >
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HUDI_RFC-2B-2D-2B09-2B-253A-2BHudi-2BDataset-2BSnapshot-2BExporter&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=Q9eYberH-EOZD1GgTVtyIDxF23q-qlbre8FY1LWgfWQ&e=
>     >     > > The task is inactive now. Feel free to pick up if this is
>     > something you'd
>     >     > > like to work on. I'd be happy to help with that.
>     >     > >
>     >     > >
>     >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> [email protected]>
>     >     > wrote:
>     >     > >
>     >     > > > Hi Raghvendra,
>     >     > > >
>     >     > > > Quick sidebar.. Please subscribe to the mailing list, so
> your
>     > message
>     >     > > > get published automatically. :)
>     >     > > >
>     >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
>     >     > > > <[email protected]> wrote:
>     >     > > >
>     >     > > > > Hi Udit,
>     >     > > > >
>     >     > > > > Thanks for information.
>     >     > > > > Actually I am struggling on following points
>     >     > > > > 1 - How can we process S3 parquet files(hourly
> partitioned)
>     > through
>     >     > > > Apache
>     >     > > > > Hudi? Is there any streaming layer we need to introduce?
> 2 - Is
>     >     > > > > there any workaround to query Hudi Dataset from Athena?
> we are
>     >     > > > > thinking to dump resulting Hudi dataset to S3, and then
>     > querying
>     >     > > > > from Athena. 3 - What should be the parquet file size
> and row
>     > group
>     >     > > > > size for better performance on querying Hudi Dataset?
>     >     > > > >
>     >     > > > > Thanks
>     >     > > > > Raghvendra
>     >     > > > >
>     >     > > > >
>     >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
>     > [email protected]>
>     >     > > > wrote:
>     >     > > > >
>     >     > > > > > Hi Raghvendra,
>     >     > > > > >
>     >     > > > > > You would have to re-write you Parquet Dataset in Hudi
>     > format.
>     >     > > > > > Here are the links you can follow to get started:
>     >     > > > > >
>     >     > > > > >
>     >     > > > >
>     >     > > >
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_emr_latest_ReleaseGuide_emr-2Dhudi-2Dwork-2Dwith&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=Fi5rpN7yxjUjbZd-YPjS2Rumt8HMwDfDQRWiE7QEGHI&e=
>     >     > > > -dataset.html
>     >     > > > > >
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dpull&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=_vmi9HrxnCH3vsr2PztvD3Su7Qsweb8Iw7CzqmCyyY8&e=
>     >     > > > > >
>     >     > > > > > Thanks,
>     >     > > > > > Udit
>     >     > > > > >
>     >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
>     >     > > > > > <[email protected]> wrote:
>     >     > > > > >
>     >     > > > > >     Hi Team,
>     >     > > > > >
>     >     > > > > >     I want to setup incremental view of my AWS S3
> parquet
>     > data
>     >     > > > > > through Apache
>     >     > > > > >     Hudi, and want to query this data through Athena,
> but
>     >     > > > > > currently
>     >     > > > > Athena
>     >     > > > > > not
>     >     > > > > >     supporting Hudi Dataset.
>     >     > > > > >
>     >     > > > > >     so there are few questions which I want to
> understand
>     > here
>     >     > > > > >
>     >     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
>     > running on
>     >     > EMR.
>     >     > > > > >
>     >     > > > > >     2 - How to query Hudi Dataset running on EMR
>     >     > > > > >
>     >     > > > > >     Please help me to understand this.
>     >     > > > > >
>     >     > > > > >     Thanks
>     >     > > > > >
>     >     > > > > >     Raghvendra
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >
>     >     > > > >
>     >     > > >
>     >     > >
>     >     >
>     >
>     >
>     >
>
>
>

Re: Apache Hudi on AWS EMR

Reply via email to