[GitHub] [hudi] huangxiaopingRD opened a new pull request, #8179: [HUDI-5932] Make the combine step in Call run_bootstrap Procedure optional

via GitHub Tue, 14 Mar 2023 03:07:06 -0700


huangxiaopingRD opened a new pull request, #8179:
URL: https://github.com/apache/hudi/pull/8179


   
   ### Change Logs
   
   In the existing implementation, if the preCombine field is not specified, 
the default value (ts) of the preCombine field will be obtained, and "ts" filed 
will not be recognized in the case of Full record Bootstrap, resulting in 
failure to generate input records. Therefore, we hope that we do not need to 
specify the preCombine field when executing bootstrap.
   ```
   Caused by: org.apache.hudi.exception.HoodieException: ts(Part -ts) field not 
found in record. Acceptable fields were :[timestamp, _row_key, partition_path, 
rider, driver, begin_lat, begin_lon, end_lat, end_lon, fare, tip_history, 
_hoodie_is_deleted, datestr]
        at 
org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:557)
        at 
org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldValAsString(HoodieAvroUtils.java:535)
        at 
org.apache.hudi.bootstrap.SparkFullBootstrapDataProviderBase.lambda$generateInputRecords$cbf13809$1(SparkFullBootstrapDataProviderBase.java:87)
        at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
        at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   
   ```
   
   ### Impact
   
   Users do not need to specify preCombine when executing bootstrap.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] huangxiaopingRD opened a new pull request, #8179: [HUDI-5932] Make the combine step in Call run_bootstrap Procedure optional

Reply via email to