Got it. Thanks Udit! On Wed, Feb 19, 2020 at 2:12 PM Mehrotra, Udit <[email protected]> wrote:
> Hi Sudha, > > Yes EMR Presto since 5.28.0 release comes with presto jars present in the > classpath. If you launch a cluster with Presto you should see it at: > > /usr/lib/presto/plugin/hive-hadoop2/hudi-presto-bundle.jar > > Thanks, > Udit > > > On 2/19/20, 1:53 PM, "Bhavani Sudha" <[email protected]> wrote: > > Hi Udit, > > Just a quick question on Presto EMR. Does EMR Presto support Hudi jars > in > its classpath ? > > On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit > <[email protected]> > wrote: > > > Workaround provided by Gary can help querying Hudi tables through > Athena > > for Copy On Write tables by basically querying only the latest > commit files > > as standard parquet. It would definitely be worth documenting, as > several > > people have asked for it and I remember providing the same > suggestion on > > slack earlier. I can add if I have the perms. > > > > >> if I connect to the Hive catalog on EMR, which is able to provide > the > > Hudi views correctly, I should be able to get correct results on > Athena > > > > As Vinoth mentioned, just connecting to metastore is not enough. > Athena > > would still use its own Presto which does not support Hudi. > > > > As for Hudi support for Athena: > > Athena does use Presto, but it's their own custom version and I don't > > think they yet have the code that Hudi guys contributed to presto > i.e. the > > split annotations etc. Also they don’t have Hudi jars in presto > classpath. > > We are not sure of any timelines for this support, but I have heard > that > > work should start soon. > > > > Thanks, > > Udit > > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <[email protected]> wrote: > > > > Thanks everyone for chiming in. Esp Gary for the detailed > workaround.. > > (should we FAQ this workaround.. food for thought) > > > > >> if I connect to the Hive catalog on EMR, which is able to > provide > > the > > Hudi views correctly, I should be able to get correct results on > Athena > > > > Knowing how the Presto/Hudi integration works, simply being able > to > > read > > from Hive metastore is not enough. Presto has code to specially > > recognize > > Hudi tables and does an additional filtering step, which lets it > query > > the > > data in there correctly. (Gary's workaround above keeps just 1 > version > > around for a given file (group)).. > > > > On Mon, Feb 17, 2020 at 11:28 PM Gary Li < > [email protected]> > > wrote: > > > > > Hello, I don't have any experience working with Athena but I > can > > share my > > > experience working with Impala. There is a workaround. > > > By setting Hudi config: > > > > > > - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS > > > - hoodie.cleaner.fileversions.retained=1 > > > > > > You will have your Hudi dataset as same as plain parquet > files. You > > can > > > create a table just like regular parquet. Hudi will write a new > > commit > > > first then delete the older files that have two versions. You > need to > > > refresh the table metadata store as soon as the Hudi Upsert job > > finishes. > > > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed > the > > older > > > files and before refresh the table metastore, the table will be > > unavailable > > > for query(1-5 mins in my case). > > > > > > How can we process S3 parquet files(hourly partitioned) through > > Apache > > > Hudi? Is there any streaming layer we need to introduce? > > > ----------- > > > Hudi Delta streamer support parquet file. You can do a > bulkInsert > > for the > > > first job then use delta streamer for the Upsert job. > > > > > > 3 - What should be the parquet file size and row group size for > > better > > > performance on querying Hudi Dataset? > > > ---------- > > > That depends on the query engine you are using and it should be > > documented > > > somewhere. For impala, the optimal size for query performance > is > > 256MB, but > > > the larger file size will make upsert more expensive. The size > I > > personally > > > choose is 100MB to 128MB. > > > > > > Thanks, > > > Gary > > > > > > > > > > > > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu > > <[email protected]> > > > wrote: > > > > > > > Athena is indeed Presto inside, but there is lot of custom > code > > which has > > > > gone on top of Presto there. > > > > Couple months back I tried running a glue crawler to catalog > a > > Hudi data > > > > set and then query it from Athena. The results were not same > as > > what I > > > > would get with running the same query using spark SQL on > EMR. Did > > not try > > > > Presto on EMR, but assuming it will work fine on EMR. > > > > > > > > Athena integration with Hudi data set is planned shortly, > but not > > sure of > > > > the date yet. > > > > > > > > However, recently Athena started supporting integration to a > Hive > > catalog > > > > apart from Glue. What that means is in Athena, if I connect > to the > > Hive > > > > catalog on EMR, which is able to provide the Hudi views > correctly, > > I > > > should > > > > be able to get correct results on Athena. Have not tested it > > though. The > > > > feature is in Preview already. > > > > > > > > Thanks > > > > Raghu > > > > -----Original Message----- > > > > From: Shiyan Xu <[email protected]> > > > > Sent: Tuesday, February 18, 2020 6:20 AM > > > > To: [email protected] > > > > Cc: Mehrotra, Udit <[email protected]>; Raghvendra Dhar > Dubey > > > > <[email protected]> > > > > Subject: Re: Apache Hudi on AWS EMR > > > > > > > > For 2) I think running presto on EMR is able to let you run > > > read-optimized > > > > queries. > > > > I don't quite understand how exactly Athena not support Hudi > as it > > is > > > > Presto underlying. > > > > Perhaps @Udit could give some insights from AWS? > > > > > > > > As @Raghvendra you mentioned, another option is to export > Hudi > > dataset to > > > > plain parquet files for Athena to query on > > > > RFC-9 is for this usecase > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HUDI_RFC-2B-2D-2B09-2B-253A-2BHudi-2BDataset-2BSnapshot-2BExporter&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=Q9eYberH-EOZD1GgTVtyIDxF23q-qlbre8FY1LWgfWQ&e= > > > > The task is inactive now. Feel free to pick up if this is > > something you'd > > > > like to work on. I'd be happy to help with that. > > > > > > > > > > > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar < > [email protected]> > > > wrote: > > > > > > > > > Hi Raghvendra, > > > > > > > > > > Quick sidebar.. Please subscribe to the mailing list, so > your > > message > > > > > get published automatically. :) > > > > > > > > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey > > > > > <[email protected]> wrote: > > > > > > > > > > > Hi Udit, > > > > > > > > > > > > Thanks for information. > > > > > > Actually I am struggling on following points > > > > > > 1 - How can we process S3 parquet files(hourly > partitioned) > > through > > > > > Apache > > > > > > Hudi? Is there any streaming layer we need to introduce? > 2 - Is > > > > > > there any workaround to query Hudi Dataset from Athena? > we are > > > > > > thinking to dump resulting Hudi dataset to S3, and then > > querying > > > > > > from Athena. 3 - What should be the parquet file size > and row > > group > > > > > > size for better performance on querying Hudi Dataset? > > > > > > > > > > > > Thanks > > > > > > Raghvendra > > > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit < > > [email protected]> > > > > > wrote: > > > > > > > > > > > > > Hi Raghvendra, > > > > > > > > > > > > > > You would have to re-write you Parquet Dataset in Hudi > > format. > > > > > > > Here are the links you can follow to get started: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_emr_latest_ReleaseGuide_emr-2Dhudi-2Dwork-2Dwith&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=Fi5rpN7yxjUjbZd-YPjS2Rumt8HMwDfDQRWiE7QEGHI&e= > > > > > -dataset.html > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dpull&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=_vmi9HrxnCH3vsr2PztvD3Su7Qsweb8Iw7CzqmCyyY8&e= > > > > > > > > > > > > > > Thanks, > > > > > > > Udit > > > > > > > > > > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey" > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > Hi Team, > > > > > > > > > > > > > > I want to setup incremental view of my AWS S3 > parquet > > data > > > > > > > through Apache > > > > > > > Hudi, and want to query this data through Athena, > but > > > > > > > currently > > > > > > Athena > > > > > > > not > > > > > > > supporting Hudi Dataset. > > > > > > > > > > > > > > so there are few questions which I want to > understand > > here > > > > > > > > > > > > > > 1 - How to stream s3 parquet file to Hudi dataset > > running on > > > EMR. > > > > > > > > > > > > > > 2 - How to query Hudi Dataset running on EMR > > > > > > > > > > > > > > Please help me to understand this. > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > Raghvendra > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
