Re: [I] [SUPPORT] Data loss due to incorrect selection of log file during compaction [hudi]

via GitHub Tue, 05 Mar 2024 17:17:28 -0800


nsivabalan commented on issue #10803:
URL: https://github.com/apache/hudi/issues/10803#issuecomment-1979906166


   Hey, I wrote a tool that could help us spit out some meta info about our log 
blocks and records. 
   https://github.com/nsivabalan/hudi/tree/printAllVersionsOfRecordTool
   here is the branch. 
   
   Can you help us run the tool and share us the output. 
   
   Its a spark submit command. Its going to log some info about the log files 
we are interested in. 
   
   sample command 
   ```
   ./bin/spark-submit --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'  --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--class org.apache.hudi.utilities.PrintRecordsTool 
PATH_TO_BUNDLE/hudi-utilities-bundle_2.12-0.15.0-SNAPSHOT.jar --props 
/tmp/props.in --base-path /tmp/hudi_trips_mor/ --partition-path 
asia/india/chennai  --file-id c3ef010f-61ae-4aa3-a033-25b278da17c6-0  
--base-instant-time 20240302002723362 --print-log-blocks-info
   ```
   
   ```
   cat /tmp/props.in 
   hoodie.datasource.write.recordkey.field=uuid
   hoodie.datasource.write.partitionpath.field=partitionpath
   hoodie.datasource.write.precombine.field=ts
   ```
   
   Ensure you set the right values for partition path, fileID and the base 
instant time. 
   This should help w/ our triaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Data loss due to incorrect selection of log file during compaction [hudi]

Reply via email to