Shawn Chang created HUDI-9119:
---------------------------------
Summary: Hudi 1.0.1 cannot write MOR tables
Key: HUDI-9119
URL: https://issues.apache.org/jira/browse/HUDI-9119
Project: Apache Hudi
Issue Type: Bug
Reporter: Shawn Chang
When testing Hudi 1.0.1 on EMR 7.8, I can see issues like below:
{code:java}
Caused by: org.apache.hudi.exception.HoodieException: Exception when reading
log file at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scanInternalV1(AbstractHoodieLogRecordScanner.java:388)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scanInternal(AbstractHoodieLogRecordScanner.java:250)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.scanByKeyPrefixes(HoodieMergedLogRecordScanner.java:196)
at
org.apache.hudi.metadata.HoodieMetadataLogRecordReader.getRecordsByKeyPrefixes(HoodieMetadataLogRecordReader.java:87)
at
org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:379)
at
org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeyPrefixes$7539c171$1(HoodieBackedTableMetadata.java:234)
at
org.apache.hudi.common.function.FunctionWrapper.lambda$throwingMapWrapper$0(FunctionWrapper.java:38)
... 39 moreCaused by: java.lang.ClassCastException: class
org.apache.avro.generic.GenericData$Record cannot be cast to class
org.apache.hudi.avro.model.HoodieDeleteRecordList
(org.apache.avro.generic.GenericData$Record is in unnamed module of loader
'app'; org.apache.hudi.avro.model.HoodieDeleteRecordList is in unnamed module
of loader org.apache.spark.util.MutableURLClassLoader @5b2ea718) at
org.apache.hudi.common.table.log.block.HoodieDeleteBlock.deserialize(HoodieDeleteBlock.java:169)
at
org.apache.hudi.common.table.log.block.HoodieDeleteBlock.getRecordsToDelete(HoodieDeleteBlock.java:124)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:678)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scanInternalV1(AbstractHoodieLogRecordScanner.java:378)
... 45 more
{code}
Reproduction steps:
# Start a EMR 7.8 cluster
# Start spark-shell with the command below
#
{code:java}
spark-shell \--packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1 \--conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' \--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
\--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
{code}
Run the script below:
#
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.spark.sql.SaveMode
val df1 = Seq( (100, "2015-01-01", "event_name_900",
"2015-01-01T13:51:39.340396Z", "type1"), (101, "2015-01-01", "event_name_546",
"2015-01-01T12:14:58.597216Z", "type2"), (102, "2015-01-01", "event_name_345",
"2015-01-01T13:51:40.417052Z", "type3"), (103, "2015-01-01", "event_name_234",
"2015-01-01T13:51:40.519832Z", "type4"), (104, "2015-01-01", "event_name_123",
"2015-01-01T12:15:00.512679Z", "type1"), (105, "2015-01-01", "event_name_678",
"2015-01-01T13:51:42.248818Z", "type2"), (106, "2015-01-01", "event_name_890",
"2015-01-01T13:51:44.735360Z", "type3"), (107, "2015-01-01", "event_name_944",
"2015-01-01T13:51:45.019544Z", "type4"), (108, "2015-01-01", "event_name_456",
"2015-01-01T13:51:45.208007Z", "type1"), (109, "2015-01-01", "event_name_567",
"2015-01-01T13:51:45.369689Z", "type2"), (110, "2015-01-01", "event_name_789",
"2015-01-01T12:15:05.664947Z", "type3"), (111, "2015-01-01", "event_name_322",
"2015-01-01T13:51:47.388239Z", "type4") ).toDF("event_id", "event_date",
"event_name", "event_ts", "event_type")
val r = scala.util.Random
val num = r.nextInt(99999)
var tableName = "yxchang_hudi_cow_simple_14_" + num
var tablePath = "s3://<yourbucket>/hudi10/" + tableName + "/"
df1.write.format("hudi")
.option("hoodie.metadata.enable", "true")
.option("hoodie.table.name", tableName)
.option("hoodie.datasource.write.operation", "insert") // use insert
.option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
.option("hoodie.datasource.write.recordkey.field", "event_id,event_date")
.option("hoodie.datasource.write.partitionpath.field", "event_type")
.option("hoodie.datasource.write.precombine.field", "event_ts")
.option("hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.ComplexKeyGenerator")
.option("hoodie.datasource.hive_sync.enable", "true")
.option("hoodie.datasource.meta.sync.enable", "true")
.option("hoodie.datasource.hive_sync.mode", "hms")
.option("hoodie.datasource.hive_sync.table", tableName)
.option("hoodie.datasource.hive_sync.partition_fields", "event_type")
.option("hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append)
.save(tablePath) {code}
In the script above, I used a COW table with MDT enabled, which can also
reproduce the issue.
Additional context:
# This exception looks like the same as
[https://github.com/apache/hudi/issues/10609]
# The same script won't have issue when using OSS Hudi 1.0.0
--
This message was sent by Atlassian Jira
(v8.20.10#820010)