neerajpadarthi opened a new issue, #11583:
URL: https://github.com/apache/hudi/issues/11583

   **Describe the problem you faced**
   
   Hi team, When loading the partitioned dataset, I am experiencing slower read 
performance, even without executing any Spark actions. Can you please check the 
following configurations/details and let us know if this is an expected delay 
even after enabling the metadata during reads? Thanks
   
   I am using EMR 6.7 with Hudi Version 0.11.0. 
   
   Spark Submit - `spark-submit --master yarn --deploy-mode client --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
--conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --conf 
spark.hadoop.fs.s3.maxRetries=50 --conf spark.shuffle.blockTransferService=nio 
--jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar `
   
   Dataset - Contains 5,864 partitions
   
   >> Metadata is disabled
   
   df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took - 
`226 Seconds`
   df.count() >> Time took - `24 Seconds`
   
   >> Metadata is enabled
   
   spark.conf.set("hoodie.metadata.enable","true")
   df = spark.sql("SELECT * FROM tst_db.tst_tb_partitioned_tst") >> Time took - 
`58 Seconds`
   df.count() >> Time took - `34 Seconds`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to