codope commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1246296509

   We had benchmarked this. For multiple spark jobs, we cannot avoid `union`. 
Good thing is that clustering does not do rdd.union, instead it runs 
context.union which is slightly better. Benchmark revealed that writing to 
parquet column as a whole incurs high overhead. Another thing which hogs up 
memory is bytes-avro conversion. More details in HUDI-2949. Te fix entailed 
making changes in 
[parquet-mr](https://github.com/apache/parquet-mr/commit/06bb358bcf8a0855c54f20122a57a88d9fde16c1).
 That fix has been merged but we have not yet upgraded parquet version in Hudi. 
Created HUDI-4840 to track the upgrade. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to