PhantomHunt commented on issue #6936:
URL: https://github.com/apache/hudi/issues/6936#issuecomment-1332212753
Ok Thanks Team Hudi! Will try the solution and let you know.
On Tue, Nov 29, 2022 at 12:45 PM Sagar Sumit ***@***.***>
wrote:
> I think the issue is due to timestamp type field with null value. The
> reason it is not reproducible during first insert is that the records go
> through HoodieCreateHandle which does not merge column stats in the first
> insert. Upon subsequent upsert, records go through HoodieAppendHandle
> which attempts to merge column stats but then fails for timestamp type if
> value is null. See below script to repro:
>
> >>> from pyspark.sql.types import StructType,StructField, TimestampType,
StringType
> >>> schema = StructType([
> ... StructField('TimePeriod', StringType(), True),
> ... StructField('StartTimeStamp', TimestampType(), True),
> ... StructField('EndTimeStamp', TimestampType(), True)
> ... ])
> >>> import time
> >>> import datetime
> >>> timestamp = datetime.datetime.strptime('16:00:00:00',"%H:%M:%S:%f")
> >>> timestamp2 = datetime.datetime.strptime('18:59:59:59',"%H:%M:%S:%f")
> >>> columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
> >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, timestamp2 )]
> >>> df2 = spark.createDataFrame(data,schema)
> >>> df2.printSchema()
> root
> |-- TimePeriod: string (nullable = true)
> |-- StartTimeStamp: timestamp (nullable = true)
> |-- EndTimeStamp: timestamp (nullable = true)
>
> >>> hudi_write_options_no_partition = {
> ... "hoodie.table.name": tableName,
> ... "hoodie.datasource.write.recordkey.field": "TimePeriod",
> ... 'hoodie.datasource.write.table.name': tableName,
> ... 'hoodie.datasource.write.precombine.field': 'EndTimeStamp',
> ... 'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
> ... 'hoodie.metadata.enable':'true',
> ... 'hoodie.metadata.index.bloom.filter.enable' : 'true',
> ... 'hoodie.metadata.index.column.stats.enable' : 'true'
> ... }
> >>>
df2.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("overwrite").save(basePath)
> 22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Cannot find
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
> 22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Properties file
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
> 22/11/29 07:00:34 WARN metadata.HoodieBackedTableMetadata: Metadata table
was not found at path file:/tmp/hudi_trips_cow/.hoodie/metadata
> [Stage 7:> (0 +
1) / 1]
>
> // update data with non-null timestamp will succeed
> >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp,
datetime.datetime.strptime('19:59:59:59',"%H:%M:%S:%f"))]
> >>> updateDF = spark.createDataFrame(data,schema)
> >>>
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
>
> // update data with null timestamp will throw exception
> >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, None)]
> >>> updateDF = spark.createDataFrame(data,schema)
> >>>
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
>
> I would suggest to clean the data if possible, replace nulls in the
> dataframe by oldest unix timestamp or some default value that is suitable
> to usecase. Ideally, this should be handled in
> HoodieTableMetadataUtil#collectColumnRangeMetadata.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/hudi/issues/6936#issuecomment-1330186251>, or
> unsubscribe
>
<https://github.com/notifications/unsubscribe-auth/AITJTTIUBF3DPAT5WETW65DWKWUPHANCNFSM6AAAAAARD5FFWQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]