codope commented on issue #6936:
URL: https://github.com/apache/hudi/issues/6936#issuecomment-1330186251
I think the issue is due to timestamp type field with null value. The reason
it is not reproducible during first insert is that the records go through
`HoodieCreateHandle` which does not merge column stats in the first insert.
Upon subsequent upsert, records go through `HoodieAppendHandle` which attempts
to merge column stats but then fails for timestamp type if value is null. See
below script to repro:
```
>>> from pyspark.sql.types import StructType,StructField, TimestampType,
StringType
>>> schema = StructType([
... StructField('TimePeriod', StringType(), True),
... StructField('StartTimeStamp', TimestampType(), True),
... StructField('EndTimeStamp', TimestampType(), True)
... ])
>>> import time
>>> import datetime
>>> timestamp = datetime.datetime.strptime('16:00:00:00',"%H:%M:%S:%f")
>>> timestamp2 = datetime.datetime.strptime('18:59:59:59',"%H:%M:%S:%f")
>>> columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
>>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, timestamp2 )]
>>> df2 = spark.createDataFrame(data,schema)
>>> df2.printSchema()
root
|-- TimePeriod: string (nullable = true)
|-- StartTimeStamp: timestamp (nullable = true)
|-- EndTimeStamp: timestamp (nullable = true)
>>> hudi_write_options_no_partition = {
... "hoodie.table.name": tableName,
... "hoodie.datasource.write.recordkey.field": "TimePeriod",
... 'hoodie.datasource.write.table.name': tableName,
... 'hoodie.datasource.write.precombine.field': 'EndTimeStamp',
... 'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
... 'hoodie.metadata.enable':'true',
... 'hoodie.metadata.index.bloom.filter.enable' : 'true',
... 'hoodie.metadata.index.column.stats.enable' : 'true'
... }
>>>
df2.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("overwrite").save(basePath)
22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Cannot find
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Properties file
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
22/11/29 07:00:34 WARN metadata.HoodieBackedTableMetadata: Metadata table
was not found at path file:/tmp/hudi_trips_cow/.hoodie/metadata
[Stage 7:> (0 + 1)
/ 1]
// update data with non-null timestamp will succeed
>>> data = [("16:00:00:00 -> 18:59:59:59", timestamp,
datetime.datetime.strptime('19:59:59:59',"%H:%M:%S:%f"))]
>>> updateDF = spark.createDataFrame(data,schema)
>>>
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
// update data with null timestamp will throw exception
>>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, None)]
>>> updateDF = spark.createDataFrame(data,schema)
>>>
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
```
I would suggest to clean the data if possible, replace nulls in the
dataframe by oldest unix timestamp or some default value that is suitable to
usecase. Ideally, this should be handled in
`HoodieTableMetadataUtil#collectColumnRangeMetadata`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]