codope commented on issue #6936:
URL: https://github.com/apache/hudi/issues/6936#issuecomment-1330186251

   I think the issue is due to timestamp type field with null value. The reason 
it is not reproducible during first insert is that the records go through 
`HoodieCreateHandle` which does not merge column stats in the first insert. 
Upon subsequent upsert, records go through `HoodieAppendHandle` which attempts 
to merge column stats but then fails for timestamp type if value is null. See 
below script to repro:
   ```
   >>> from pyspark.sql.types import StructType,StructField, TimestampType, 
StringType
   >>> schema = StructType([
   ...   StructField('TimePeriod', StringType(), True),
   ...   StructField('StartTimeStamp', TimestampType(), True),
   ...   StructField('EndTimeStamp', TimestampType(), True)
   ... ])
   >>> import time
   >>> import datetime
   >>> timestamp = datetime.datetime.strptime('16:00:00:00',"%H:%M:%S:%f")
   >>> timestamp2 = datetime.datetime.strptime('18:59:59:59',"%H:%M:%S:%f")
   >>> columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
   >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, timestamp2 )]
   >>> df2 = spark.createDataFrame(data,schema)
   >>> df2.printSchema()
   root
    |-- TimePeriod: string (nullable = true)
    |-- StartTimeStamp: timestamp (nullable = true)
    |-- EndTimeStamp: timestamp (nullable = true)
   
   >>> hudi_write_options_no_partition = {
   ... "hoodie.table.name": tableName,
   ... "hoodie.datasource.write.recordkey.field": "TimePeriod",
   ... 'hoodie.datasource.write.table.name': tableName,
   ... 'hoodie.datasource.write.precombine.field': 'EndTimeStamp',
   ... 'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
   ... 'hoodie.metadata.enable':'true',
   ... 'hoodie.metadata.index.bloom.filter.enable' : 'true',
   ... 'hoodie.metadata.index.column.stats.enable' : 'true'
   ... }
   >>> 
df2.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("overwrite").save(basePath)
   22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   22/11/29 07:00:34 WARN metadata.HoodieBackedTableMetadata: Metadata table 
was not found at path file:/tmp/hudi_trips_cow/.hoodie/metadata
   [Stage 7:>                                                          (0 + 1) 
/ 1]
   
   // update data with non-null timestamp will succeed
   >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, 
datetime.datetime.strptime('19:59:59:59',"%H:%M:%S:%f"))]
   >>> updateDF = spark.createDataFrame(data,schema)
   >>> 
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
   
   // update data with null timestamp will throw exception
   >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, None)]
   >>> updateDF = spark.createDataFrame(data,schema)
   >>> 
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
   ```
   
   I would suggest to clean the data if possible, replace nulls in the 
dataframe by oldest unix timestamp or some default value that is suitable to 
usecase. Ideally, this should be handled in 
`HoodieTableMetadataUtil#collectColumnRangeMetadata`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to