CrystalCat opened a new issue, #9865: URL: https://github.com/apache/hudi/issues/9865
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. download spark-3.4.1-bin-hadoop3.tgz and unpack,in the spark root dir ``` tar -xf spark-3.4.1-bin-hadoop3.tgz cd spark-3.4.1-bin-hadoop3 mkdir dataset cd dataset # download dataset wget "https://drive.usercontent.google.com/download?id=1cSQwS4TNwB_VaEtYcJDlggu29GOetea5&export=download&authuser=0&confirm=t&uuid=eb9431f1-5bbc-438a-9c2c-a04c890b3bea&at=APZUnTUFRG2fpvb1QViksPKtDaP1:1697298863825" -O events.zip unzip events.zip ``` 2.start spark-shell and save csv as orc ``` cd spark-3.4.1-bin-hadoop3 ./bin/spark-shell --driver-memory 8g --master local[24] val events = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("./dataset/events.csv") events.write.format("orc").mode("overwrite").saveAsTable("events") ``` 3.start spark-sql with hudi ``` export SPARK_VERSION=3.4 ;./bin/spark-sql --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.14.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' --driver-memory 8g --master local[24] ``` 4.create hudi table ```sql drop table hudi_events; CREATE TABLE default.hudi_events ( timestamp BIGINT, visitorid INT, event STRING, itemid INT, transactionid INT ) USING HUDI PARTITIONED BY (event) TBLPROPERTIES ( primaryKey = 'visitorid', preCombineField = 'timestamp', hoodie.index.type= 'GLOBAL_BLOOM', type = 'cow' ); insert into hudi_events select * from events; ``` 5. merge into hudi_events table with all data in events table. ```sql merge into hudi_events as target using events as source on target.timestamp = source.timestamp when matched then update set * when not matched then insert * ; ``` 6.create a table with subset of events then merge into hudi_events ```sql create table hudi_800000 as select * from events limit 800000; merge into hudi_events as target using hudi_800000 as source on target.timestamp = source.timestamp when matched then update set * when not matched then insert * ; ``` **Expected behavior** merge operation successfully **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : N/A * Hadoop version : N/A * Storage (HDFS/S3/GCS..) : local filesystem * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ``` spark-sql (default)> > merge into hudi_events as target > using hudi_800000 as source > on target.timestamp = source.timestamp > when matched then update set * > when not matched then insert * > ; 23/10/15 10:30:59 WARN BlockManager: Putting block rdd_302_3 failed due to exception java.lang.ArrayIndexOutOfBoundsException: 25. 23/10/15 10:30:59 WARN BlockManager: Block rdd_302_3 could not be removed as it was not found on disk or in memory 23/10/15 10:30:59 WARN BlockManager: Putting block rdd_302_1 failed due to exception java.lang.ArrayIndexOutOfBoundsException: 25. 23/10/15 10:30:59 WARN BlockManager: Block rdd_302_1 could not be removed as it was not found on disk or in memory 23/10/15 10:30:59 WARN BlockManager: Putting block rdd_302_2 failed due to exception java.lang.ArrayIndexOutOfBoundsException: 25. 23/10/15 10:30:59 ERROR Executor: Exception in task 3.0 in stage 130.0 (TID 1470) java.lang.ArrayIndexOutOfBoundsException: 25 at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:460) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:283) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:219) at org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:209) at org.apache.spark.sql.hudi.command.payload.ExpressionPayload.getInsertValue(ExpressionPayload.scala:237) at org.apache.hudi.common.model.HoodieAvroRecord.toIndexedRecord(HoodieAvroRecord.java:211) at org.apache.hudi.common.model.HoodieAvroRecordMerger.combineAndGetUpdateValue(HoodieAvroRecordMerger.java:57) at org.apache.hudi.common.model.HoodieAvroRecordMerger.merge(HoodieAvroRecordMerger.java:47) at org.apache.hudi.index.HoodieIndexUtils.mergeIncomingWithExistingRecord(HoodieIndexUtils.java:262) at ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
