codope commented on issue #4891: URL: https://github.com/apache/hudi/issues/4891#issuecomment-1246296509
We had benchmarked this. For multiple spark jobs, we cannot avoid `union`. Good thing is that clustering does not do rdd.union, instead it runs context.union which is slightly better. Benchmark revealed that writing to parquet column as a whole incurs high overhead. Another thing which hogs up memory is bytes-avro conversion. More details in HUDI-2949. Te fix entailed making changes in [parquet-mr](https://github.com/apache/parquet-mr/commit/06bb358bcf8a0855c54f20122a57a88d9fde16c1). That fix has been merged but we have not yet upgraded parquet version in Hudi. Created HUDI-4840 to track the upgrade. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
