zhangyue19921010 commented on PR #6600:
URL: https://github.com/apache/hudi/pull/6600#issuecomment-1318166797

   > @zhangyue19921010 thanks for taking this up! Some high level thoughts:
   > 
   > * **hudi commit metadata vs hudi metrics**: if users enable diagnostic 
reporter, should we have a config to include metrics reporter's data? metrics 
system is good at showing the trends but hard to cross-check against commit 
metadata. so regardless of enabling metrics reporter or not, diagnostic 
reporter can collect metrics and save to report dir, just like a csv/json 
metrics reporter. We can also refine what goes to metrics and what goes to 
commit metadata, to keep the responsibilities clear and reporting data 
organized.
   > * **consolidate with error table**: 
[RFC-20](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records)
 this is a long-pending feature that also aims to assist investigation. 
diagnostic reporter should be aware of error table settings and zip the error 
table if configured so. Size could be a concern, so configuration can be given 
to zip the whole table, or sample records, or skip error table completely. Also 
it requires some config to allow masking any fields. Taking a step further, we 
can also make error table one of the diagnostic reporting features. They have 
similar storage structures: can be local to the hudi table or global to the 
whole platform.
   > * **work with metadata table**: you've already mentioned collecting stats 
by listing the file system. diagnostic reporter should also be aware of the 
presence of metadata table and zip the table or extract relevant data - 
fallback to file system listing if not present.
   
   Thanks @xushiyan for your advice!
   Will have a deep look and expand this rfc asap!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to