Re: Migrate Existing DataFrame to Hudi DataSet

Zhengxiang Pan Mon, 11 Nov 2019 08:24:28 -0800

Thanks for quick response. will try to create snippet to reduce the issue.

For number 2), I am aware of the de-dup behavior.  pretty sure the
precombine key is unique.


Thanks

On Mon, Nov 11, 2019 at 8:46 AM Vinoth Chandar <[email protected]> wrote:

> Hi,
>
> On 1. I am wondering if its relatd to
> https://issues.apache.org/jira/browse/HUDI-83 , i.e support for
> timestamps.
> if you can give us a small snippet to reproduce the problem that would be
> great.
>
> On 2, Not sure whats going on. there are no size limitations. Please check
> if you precombine field and keys are correct.. for eg if you pick a
> field/value that is in all records,then precombine will crunch it down to
> just 1 record, coz thats what we ask it do.
>
> On Sun, Nov 10, 2019 at 6:46 PM Zhengxiang Pan <[email protected]> wrote:
>
> > Hi,
> > I am new to the Hudi, my first attempt is to convert my existing
> dataframe
> > to Hudi managed dataset. I follow the Quick guide and Option (2) or (3)
> In
> > Migration Guide. Got two issues
> >
> > 1) Got the following error when Append mode afterward to upsert the data
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4
> > in stage 23.0 failed 4 times, most recent failure: Lost task 4.3 in stage
> > 23.0 (TID 74, tkcnode49.alphonso.tv, executor 7):
> > org.apache.hudi.exception.HoodieUpsertException: Error upserting
> bucketType
> > UPDATE for partition :4
> >         at
> >
> >
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:261)
> >         at
> >
> >
> org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
> >         at
> >
> >
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
> >         at
> >
> >
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
> >         at
> >
> >
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
> >         at
> >
> >
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
> >         at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> >         at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> >         at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> >         at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> >         at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> >         at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> >         at
> >
> >
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
> >         at
> >
> >
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> >         at
> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> >         at
> >
> >
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> >         at
> >
> >
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> >         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> >         at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> >         at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> >         at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >         at org.apache.spark.scheduler.Task.run(Task.scala:121)
> >         at
> >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> >         at
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >         at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> >
> > I noticed that "Date" type is converted to "Long" type in hudi dataset.
> >
> > I workaround to save my dataframe to JSONL, and read back to save it to
> > Hudi managed dataset.
> >
> > are there any requirement for data schema conversion explicitly from my
> > original data frame?
> >
> > 2) even if I managed to get around first issue,  the number of records in
> > Hudi managed data is way less than my original data frame.
> >
> > Is there any size limitation in Hudi dataset?
> >
> > Thanks
> >
>

Re: Migrate Existing DataFrame to Hudi DataSet

Reply via email to