CrystalCat opened a new issue, #9865:
URL: https://github.com/apache/hudi/issues/9865

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   Yes
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. download spark-3.4.1-bin-hadoop3.tgz and unpack,in the spark root dir
   ```
   tar -xf spark-3.4.1-bin-hadoop3.tgz
   cd spark-3.4.1-bin-hadoop3
   mkdir dataset
   cd dataset
   
   # download dataset
   wget 
"https://drive.usercontent.google.com/download?id=1cSQwS4TNwB_VaEtYcJDlggu29GOetea5&export=download&authuser=0&confirm=t&uuid=eb9431f1-5bbc-438a-9c2c-a04c890b3bea&at=APZUnTUFRG2fpvb1QViksPKtDaP1:1697298863825";
 -O events.zip
   unzip events.zip
   ```
   2.start spark-shell and save csv as orc
   ```
   cd spark-3.4.1-bin-hadoop3
   ./bin/spark-shell --driver-memory 8g --master local[24]
   val events = spark.read.format("csv").option("sep", 
",").option("inferSchema", "true").option("header", 
"true").load("./dataset/events.csv")
   events.write.format("orc").mode("overwrite").saveAsTable("events")
   ```
   3.start spark-sql with hudi
   ```
   export SPARK_VERSION=3.4 ;./bin/spark-sql --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.14.0 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' 
--driver-memory 8g --master local[24]
   ```
   4.create hudi table
   ```sql
   drop table hudi_events;
   CREATE TABLE default.hudi_events (
     timestamp BIGINT,
     visitorid INT,
     event STRING,
     itemid INT,
     transactionid INT
   ) USING HUDI
   PARTITIONED BY (event)
   TBLPROPERTIES (
     primaryKey = 'visitorid',
      preCombineField = 'timestamp',
      hoodie.index.type= 'GLOBAL_BLOOM',
     type = 'cow'
   );
   insert into hudi_events select * from events;
   
   ```
   5. merge into hudi_events table with all data in events table.
   ```sql
   merge into hudi_events as target
   using events as source
   on target.timestamp = source.timestamp
   when matched then update set *
   when not matched then insert *
   ;
   
   ```
   6.create a table with  subset  of events then merge into hudi_events
   ```sql
   create table hudi_800000 as select * from events limit 800000;
   
   merge into hudi_events as target
   using hudi_800000 as source
   on target.timestamp = source.timestamp
   when matched then update set *
   when not matched then insert *
   ;
   ```
   
   **Expected behavior**
   merge operation successfully 
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version : N/A
   
   * Hadoop version : N/A
   
   * Storage (HDFS/S3/GCS..) : local filesystem
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   spark-sql (default)>
                      > merge into hudi_events as target
                      > using hudi_800000 as source
                      > on target.timestamp = source.timestamp
                      > when matched then update set *
                      > when not matched then insert *
                      > ;
   23/10/15 10:30:59 WARN BlockManager: Putting block rdd_302_3 failed due to 
exception java.lang.ArrayIndexOutOfBoundsException: 25.
   23/10/15 10:30:59 WARN BlockManager: Block rdd_302_3 could not be removed as 
it was not found on disk or in memory
   23/10/15 10:30:59 WARN BlockManager: Putting block rdd_302_1 failed due to 
exception java.lang.ArrayIndexOutOfBoundsException: 25.
   23/10/15 10:30:59 WARN BlockManager: Block rdd_302_1 could not be removed as 
it was not found on disk or in memory
   23/10/15 10:30:59 WARN BlockManager: Putting block rdd_302_2 failed due to 
exception java.lang.ArrayIndexOutOfBoundsException: 25.
   23/10/15 10:30:59 ERROR Executor: Exception in task 3.0 in stage 130.0 (TID 
1470)
   java.lang.ArrayIndexOutOfBoundsException: 25
        at 
org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:460)
        at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:283)
        at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
        at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
        at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
        at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
        at 
org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:219)
        at 
org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:209)
        at 
org.apache.spark.sql.hudi.command.payload.ExpressionPayload.getInsertValue(ExpressionPayload.scala:237)
        at 
org.apache.hudi.common.model.HoodieAvroRecord.toIndexedRecord(HoodieAvroRecord.java:211)
        at 
org.apache.hudi.common.model.HoodieAvroRecordMerger.combineAndGetUpdateValue(HoodieAvroRecordMerger.java:57)
        at 
org.apache.hudi.common.model.HoodieAvroRecordMerger.merge(HoodieAvroRecordMerger.java:47)
        at 
org.apache.hudi.index.HoodieIndexUtils.mergeIncomingWithExistingRecord(HoodieIndexUtils.java:262)
        at ...
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to