Hans-Raintree opened a new issue, #9508: URL: https://github.com/apache/hudi/issues/9508
Describe the problem you faced When reading cdc logs in PySpark the before/after columns are returned in json format. It's difficult to convert them back to the correct data types, ie. a timestamp is returned as a number and if you want to convert it back to a timestamp you have to do something like this: df = df.withColumn(col_name, to_timestamp(from_unixtime(df[col_name]/1000000))) With decimal datatype it's stored as something like this [0, 0, 0, 0, 0, 0, 26, -27, -78], which is even more difficult to parse back to a number. I see that the data types are stored in the .cdc file, but I don't see how I can access them. To Reproduce Steps to reproduce the behavior: Write a table with 'hoodie.table.cdc.enabled': 'true', 'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after' Read the cdc logs with: 'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': begin_time, 'hoodie.datasource.query.incremental.format': 'cdc' Expected behavior A way to extract columns from the before/after columns with the datatype. Environment Description Hudi version : 0.13.1 Spark version : 3.3.2 / 3.4.0 Hive version : 3.1.3 Hadoop version : 2.7 / 3.3.3 Storage (HDFS/S3/GCS..) : S3 / Local Running on Docker? (yes/no) : no Additional context Happens both in AWS EMR 6.12.0 and when running locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
