Hans-Raintree opened a new issue, #9508:
URL: https://github.com/apache/hudi/issues/9508

   Describe the problem you faced
   
   When reading cdc logs in PySpark the before/after columns are returned in 
json format. It's difficult to convert them back to the correct data types, ie. 
a timestamp is returned as a number and if you want to convert it back to a 
timestamp you have to do something like this:
   
   df = df.withColumn(col_name, 
to_timestamp(from_unixtime(df[col_name]/1000000)))
   
   With decimal datatype it's stored as something like this [0, 0, 0, 0, 0, 0, 
26, -27, -78], which is even more difficult to parse back to a number.
   
   I see that the data types are stored in the .cdc file, but I don't see how I 
can access them.
   
   To Reproduce
   
   Steps to reproduce the behavior:
   
   Write a table with
   'hoodie.table.cdc.enabled': 'true',
   'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after'
   Read the cdc logs with:
   'hoodie.datasource.query.type': 'incremental',
   'hoodie.datasource.read.begin.instanttime': begin_time,
   'hoodie.datasource.query.incremental.format': 'cdc'
   
   Expected behavior
   
   A way to extract columns from the before/after columns with the datatype.
   
   Environment Description
   
   Hudi version : 0.13.1
   
   Spark version : 3.3.2 / 3.4.0
   
   Hive version : 3.1.3
   
   Hadoop version : 2.7 / 3.3.3
   
   Storage (HDFS/S3/GCS..) : S3 / Local
   
   Running on Docker? (yes/no) : no
   
   Additional context
   
   Happens both in AWS EMR 6.12.0 and when running locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to