https://issues.apache.org/jira/browse/HUDI-648 Filed to track error tables..
Please ping on the ticket, if anyone is interested in picking it up. On Fri, Feb 28, 2020 at 4:58 AM Raghvendra Dhar Dubey <[email protected]> wrote: > Hi Udit, > > I tried Hudi version 0.5.1, and it worked fine, this issue was appeared > with Hudi 0.5.0. other EMR related issues has been discussed with Rahul. > Thanks to all of you for cooperation. > > Thanks > Raghvendra > > On Fri, Feb 28, 2020 at 5:34 AM Mehrotra, Udit <[email protected]> wrote: > > > Raghvendra, > > > > Can you enable TRACE level logging for Hudi on EMR, and provide the error > > logs. For this go to /etc/spark/conf/log4j.properties and change logging > > level of log4j.logger.org.apache.hudi to TRACE. This would help provide > the > > failed record/keys based off > > > https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287 > > > > Another thing that would help is to provide the Avro schema that gets > > printed on the driver when you run your job. We need to understand which > > field and why it is treated as INT96, because current parquet-avro does > not > > handle its conversion. Also, for any other questions about EMR we can > > discuss it in the meeting you have setup with Rahul from EMR team. > > > > Thanks, > > Udit > > > > On 2/27/20, 11:00 AM, "Shiyan Xu" <[email protected]> wrote: > > > > +1 on the idea. Giving an config like `--error-path` where all failed > > conversions are saved provides flexibility for later processing. > > SQS/SNS > > can pick that up later. > > > > On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar <[email protected]> > > wrote: > > > > > On the second part, it seems like a question for EMR folks ? > > > > > > Hudi's RDD level APIs, do hand the failure records back and .. May > > be we > > > should consider writing out the error records somewhere for the > > datasource > > > as well.? > > > others any thoughts? > > > > > > On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey > > > <[email protected]> wrote: > > > > > > > Thanks Gary and Udit, > > > > > > > > I tried HudiDeltaStreamer for reading parquet files from s3 but > > there is > > > > an issue while AvroSchemaConverter not able to convert Parquet > > INT96. so > > > I > > > > thought to use Spark Structured Streaming to read data from s3 > and > > write > > > > into Hudi, but as Databricks providing "cloudfiles" for failure > > handling, > > > > Is there something in EMR? or do we need to manually handle this > > failure > > > by > > > > introducing SQS and SNS? > > > > > > > > > > > > > > > > On 2020/02/18 20:03:16, "Mehrotra, Udit" > <[email protected] > > > > > > > wrote: > > > > > Workaround provided by Gary can help querying Hudi tables > through > > > Athena > > > > for Copy On Write tables by basically querying only the latest > > commit > > > files > > > > as standard parquet. It would definitely be worth documenting, as > > several > > > > people have asked for it and I remember providing the same > > suggestion on > > > > slack earlier. I can add if I have the perms. > > > > > > > > > > >> if I connect to the Hive catalog on EMR, which is able to > > provide > > > the > > > > > Hudi views correctly, I should be able to get correct > > results on > > > > Athena > > > > > > > > > > As Vinoth mentioned, just connecting to metastore is not > enough. > > Athena > > > > would still use its own Presto which does not support Hudi. > > > > > > > > > > As for Hudi support for Athena: > > > > > Athena does use Presto, but it's their own custom version and I > > don't > > > > think they yet have the code that Hudi guys contributed to presto > > i.e. > > > the > > > > split annotations etc. Also they don’t have Hudi jars in presto > > > classpath. > > > > We are not sure of any timelines for this support, but I have > > heard that > > > > work should start soon. > > > > > > > > > > Thanks, > > > > > Udit > > > > > > > > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <[email protected]> > > wrote: > > > > > > > > > > Thanks everyone for chiming in. Esp Gary for the detailed > > > > workaround.. > > > > > (should we FAQ this workaround.. food for thought) > > > > > > > > > > >> if I connect to the Hive catalog on EMR, which is able > to > > > provide > > > > the > > > > > Hudi views correctly, I should be able to get correct > > results on > > > > Athena > > > > > > > > > > Knowing how the Presto/Hudi integration works, simply being > > able to > > > > read > > > > > from Hive metastore is not enough. Presto has code to > > specially > > > > recognize > > > > > Hudi tables and does an additional filtering step, which > > lets it > > > > query the > > > > > data in there correctly. (Gary's workaround above keeps > just > > 1 > > > > version > > > > > around for a given file (group)).. > > > > > > > > > > On Mon, Feb 17, 2020 at 11:28 PM Gary Li < > > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hello, I don't have any experience working with Athena > but > > I can > > > > share my > > > > > > experience working with Impala. There is a workaround. > > > > > > By setting Hudi config: > > > > > > > > > > > > - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS > > > > > > - hoodie.cleaner.fileversions.retained=1 > > > > > > > > > > > > You will have your Hudi dataset as same as plain parquet > > files. > > > > You can > > > > > > create a table just like regular parquet. Hudi will write > > a new > > > > commit > > > > > > first then delete the older files that have two versions. > > You > > > need > > > > to > > > > > > refresh the table metadata store as soon as the Hudi > > Upsert job > > > > finishes. > > > > > > For impala, it's simply REFRESH TABLE xxx. After Hudi > > vacuumed > > > the > > > > older > > > > > > files and before refresh the table metastore, the table > > will be > > > > unavailable > > > > > > for query(1-5 mins in my case). > > > > > > > > > > > > How can we process S3 parquet files(hourly partitioned) > > through > > > > Apache > > > > > > Hudi? Is there any streaming layer we need to introduce? > > > > > > ----------- > > > > > > Hudi Delta streamer support parquet file. You can do a > > bulkInsert > > > > for the > > > > > > first job then use delta streamer for the Upsert job. > > > > > > > > > > > > 3 - What should be the parquet file size and row group > > size for > > > > better > > > > > > performance on querying Hudi Dataset? > > > > > > ---------- > > > > > > That depends on the query engine you are using and it > > should be > > > > documented > > > > > > somewhere. For impala, the optimal size for query > > performance is > > > > 256MB, but > > > > > > the larger file size will make upsert more expensive. The > > size I > > > > personally > > > > > > choose is 100MB to 128MB. > > > > > > > > > > > > Thanks, > > > > > > Gary > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu > > > > <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > Athena is indeed Presto inside, but there is lot of > > custom code > > > > which has > > > > > > > gone on top of Presto there. > > > > > > > Couple months back I tried running a glue crawler to > > catalog a > > > > Hudi data > > > > > > > set and then query it from Athena. The results were not > > same as > > > > what I > > > > > > > would get with running the same query using spark SQL > on > > EMR. > > > > Did not try > > > > > > > Presto on EMR, but assuming it will work fine on EMR. > > > > > > > > > > > > > > Athena integration with Hudi data set is planned > > shortly, but > > > > not sure of > > > > > > > the date yet. > > > > > > > > > > > > > > However, recently Athena started supporting integration > > to a > > > > Hive catalog > > > > > > > apart from Glue. What that means is in Athena, if I > > connect to > > > > the Hive > > > > > > > catalog on EMR, which is able to provide the Hudi views > > > > correctly, I > > > > > > should > > > > > > > be able to get correct results on Athena. Have not > > tested it > > > > though. The > > > > > > > feature is in Preview already. > > > > > > > > > > > > > > Thanks > > > > > > > Raghu > > > > > > > -----Original Message----- > > > > > > > From: Shiyan Xu <[email protected]> > > > > > > > Sent: Tuesday, February 18, 2020 6:20 AM > > > > > > > To: [email protected] > > > > > > > Cc: Mehrotra, Udit <[email protected]>; Raghvendra > Dhar > > Dubey > > > > > > > <[email protected]> > > > > > > > Subject: Re: Apache Hudi on AWS EMR > > > > > > > > > > > > > > For 2) I think running presto on EMR is able to let you > > run > > > > > > read-optimized > > > > > > > queries. > > > > > > > I don't quite understand how exactly Athena not support > > Hudi as > > > > it is > > > > > > > Presto underlying. > > > > > > > Perhaps @Udit could give some insights from AWS? > > > > > > > > > > > > > > As @Raghvendra you mentioned, another option is to > > export Hudi > > > > dataset to > > > > > > > plain parquet files for Athena to query on > > > > > > > RFC-9 is for this usecase > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter > > > > > > > The task is inactive now. Feel free to pick up if this > is > > > > something you'd > > > > > > > like to work on. I'd be happy to help with that. > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar < > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > Hi Raghvendra, > > > > > > > > > > > > > > > > Quick sidebar.. Please subscribe to the mailing list, > > so your > > > > message > > > > > > > > get published automatically. :) > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > Hi Udit, > > > > > > > > > > > > > > > > > > Thanks for information. > > > > > > > > > Actually I am struggling on following points > > > > > > > > > 1 - How can we process S3 parquet files(hourly > > partitioned) > > > > through > > > > > > > > Apache > > > > > > > > > Hudi? Is there any streaming layer we need to > > introduce? 2 > > > - > > > > Is > > > > > > > > > there any workaround to query Hudi Dataset from > > Athena? we > > > > are > > > > > > > > > thinking to dump resulting Hudi dataset to S3, and > > then > > > > querying > > > > > > > > > from Athena. 3 - What should be the parquet file > > size and > > > > row group > > > > > > > > > size for better performance on querying Hudi > Dataset? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Raghvendra > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Raghvendra, > > > > > > > > > > > > > > > > > > > > You would have to re-write you Parquet Dataset in > > Hudi > > > > format. > > > > > > > > > > Here are the links you can follow to get started: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with > > > > > > > > -dataset.html > > > > > > > > > > > > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Udit > > > > > > > > > > > > > > > > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey" > > > > > > > > > > <[email protected]> > wrote: > > > > > > > > > > > > > > > > > > > > Hi Team, > > > > > > > > > > > > > > > > > > > > I want to setup incremental view of my AWS S3 > > parquet > > > > data > > > > > > > > > > through Apache > > > > > > > > > > Hudi, and want to query this data through > > Athena, but > > > > > > > > > > currently > > > > > > > > > Athena > > > > > > > > > > not > > > > > > > > > > supporting Hudi Dataset. > > > > > > > > > > > > > > > > > > > > so there are few questions which I want to > > understand > > > > here > > > > > > > > > > > > > > > > > > > > 1 - How to stream s3 parquet file to Hudi > > dataset > > > > running on > > > > > > EMR. > > > > > > > > > > > > > > > > > > > > 2 - How to query Hudi Dataset running on EMR > > > > > > > > > > > > > > > > > > > > Please help me to understand this. > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > Raghvendra > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
