Re: [I] [SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD [hudi]

via GitHub Mon, 09 Dec 2024 13:08:19 -0800


mzheng-plaid commented on issue #12434:
URL: https://github.com/apache/hudi/issues/12434#issuecomment-2529458882


   This is problematic even on the read optimized table (ie. just the base 
parquet files), which is really surprising
   
   I tried:
   1. A read-optimized query on the Hudi table
   2. Calling `spark.read.format("parquet").load({s3_path})`
   
   And just reading the parquet files directly was _much_ less memory intensive 
and faster (ie. not spilling to disk) when I tuned 
`spark.sql.files.maxPartitionBytes`. I understand this will read multiple 
versions of the file groups but its surprising how much worse read performance 
is with Hudi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD [hudi]

Reply via email to