[GitHub] [hudi] luyongbiao opened a new issue, #8416: [SUPPORT] data loss after createRdd method in HoodieSparkUtils.scala

via GitHub Sun, 09 Apr 2023 23:57:14 -0700


luyongbiao opened a new issue, #8416:
URL: https://github.com/apache/hudi/issues/8416


   **Describe the problem you faced**
   
   I have a MOR table, consists of 10 base files and 4 log files. There are 377 
fields and 1403708 records in the table.  
   my application read the table and wirte data to another COW table through 
spark sql "select * from my_mor_table". 
   I found that my COW table only has 1403704 records and 4 records are 
missing. 
   
   By debugging, I found that data is lost after createRdd method in 
HoodieSparkUtils.scala, which converts Row to GenericRecord.
   But the strange thing is, I use another spark sql "select 
fields1,feids2,...fieds100 from my_mor_table", in the select statement the 
number of fields is less than or equal to 100, my mor_table has 1403708 
records, no data loss.  When the number of select statement fields exceeds 100, 
data will be lost. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. create a COW table（table1）which the number of fields exceeds 100.
   2.  insert 10 mock records with INSERT action. 
   3. change table1's table type to MOR. 
   4. update 2 records of table1 with UPSERT action.
   5. create a COW table(table2)
   6.  Dataset<Row> dataset = spark.sql(select * from table1);
        dataset.count();
   result->10
   7. dataset.write()
                  .format("org.apache.hudi")
                  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"field1")
                  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"")
                  .option(HoodieWriteConfig.TABLE_NAME, table2)
                  .options(xxx)
                  .mode(SaveMode.Append)
                  .save(table2Path);
       Dataset<Row> dataset2 = spark.sql(select * from table2);
       dataset2.count();
   result -> 9 (1 record loss)
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.1.1
   
   * Hive version : 1.2.1
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] luyongbiao opened a new issue, #8416: [SUPPORT] data loss after createRdd method in HoodieSparkUtils.scala

Reply via email to