[
https://issues.apache.org/jira/browse/HUDI-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881689#comment-17881689
]
Lin Liu commented on HUDI-7848:
-------------------------------
The root cause is the numeric literals for delete record does not have the
correct data type during write. In the following query, if we change the 1000
to 1000L, then the problem is gone. But interestingly is, this is a not a
problem for inserted or updated literals.
{code:java}
merge into h1_p t0
using (
select 1 as id, '_delete' as name, 10 as price, 1000 as ts, '2021-05-07' as dt
union
select 2 as id, '_update' as name, 12 as price, 1001 as ts, '2021-05-07' as dt
union
select 6 as id, '_insert' as name, 10 as price, 1000 as ts, '2021-05-08' as dt
) s0
on s0.id = t0.id
when matched and s0.name = '_update'
then update set id = s0.id, name = s0.name, price = s0.price, ts = s0.ts, dt
= s0.dt
when matched and s0.name = '_delete' then delete
when not matched and s0.name = '_insert' then insert *; {code}
> Fix the Comparable type of the ordering field value stored in delete record
> ---------------------------------------------------------------------------
>
> Key: HUDI-7848
> URL: https://issues.apache.org/jira/browse/HUDI-7848
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Ethan Guo
> Assignee: Lin Liu
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The delete records stored in the log file may contain different Comparable
> type compared to the ordering field stored in the table schema, which can
> cause class cast excepting when comparing the ordering values in the event
> time-based merging. To get rid of this issue, right now the workaround is to
> cast the ordering number to long for comparison.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 83.0 failed 1 times, most recent failure: Lost task 0.0 in stage 83.0
> (TID 117, fv-az1445-662.i1cbdw21p5ze5pcdzcvy21hj0h.bx.internal.cloudapp.net,
> executor driver): java.lang.ClassCastException: java.lang.Integer cannot be
> cast to java.lang.Long
> at java.lang.Long.compareTo(Long.java:54)
> at
> org.apache.hudi.common.table.read.HoodieBaseFileGroupRecordBuffer.merge(HoodieBaseFileGroupRecordBuffer.java:360)
> at
> org.apache.hudi.common.table.read.HoodieBaseFileGroupRecordBuffer.hasNextBaseRecord(HoodieBaseFileGroupRecordBuffer.java:441)
> at
> org.apache.hudi.common.table.read.HoodieKeyBasedFileGroupRecordBuffer.doHasNext(HoodieKeyBasedFileGroupRecordBuffer.java:139)
> at
> org.apache.hudi.common.table.read.HoodieBaseFileGroupRecordBuffer.hasNext(HoodieBaseFileGroupRecordBuffer.java:124)
> at
> org.apache.hudi.common.table.read.HoodieFileGroupReader.hasNext(HoodieFileGroupReader.java:208)
> at
> org.apache.hudi.common.table.read.HoodieFileGroupReader$HoodieFileGroupReaderIterator.hasNext(HoodieFileGroupReader.java:269)
> at
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat$$anon$1.hasNext(HoodieFileGroupReaderBasedParquetFileFormat.scala:254)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
> at
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)