mzheng-plaid commented on issue #12434:
URL: https://github.com/apache/hudi/issues/12434#issuecomment-2529458882
This is problematic even on the read optimized table (ie. just the base
parquet files), which is really surprising
I tried:
1. A read-optimized query on the Hudi table
2. Calling `spark.read.format("parquet").load({s3_path})`
And just reading the parquet files directly was _much_ less memory intensive
and faster (ie. not spilling to disk) when I tuned
`spark.sql.files.maxPartitionBytes`. I understand this will read multiple
versions of the file groups but its surprising how much worse read performance
is with Hudi.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]