neerajpadarthi opened a new issue, #11583:
URL: https://github.com/apache/hudi/issues/11583
**Describe the problem you faced**
Hi team, When loading the partitioned dataset, I am experiencing slower read
performance, even without executing any Spark actions. Can you please check the
following configurations/details and let us know if this is an expected delay
even after enabling the metadata during reads? Thanks
I am using EMR 6.7 with Hudi Version 0.11.0.
Spark Submit - `spark-submit --master yarn --deploy-mode client --conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
spark.sql.hive.convertMetastoreParquet=false --conf
spark.hadoop.fs.s3.maxRetries=50 --conf spark.shuffle.blockTransferService=nio
--jars
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar `
Dataset - Contains 5,864 partitions
>> Metadata is disabled
df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took -
`226 Seconds`
df.count() >> Time took - `24 Seconds`
>> Metadata is enabled
spark.conf.set("hoodie.metadata.enable","true")
df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took -
`58 Seconds`
df.count() >> Time took - `34 Seconds`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]