[GitHub] [hudi] rubenssoto edited a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

GitBox Mon, 25 Jan 2021 07:44:55 -0800


rubenssoto edited a comment on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766821182



   @vinothchandar 
   
   Thank you so much for your answer.
   When do you plan to release this version? I will try to make some 
workarounds until then.
   
   
   Is this configuration right?
   ```
   { "conf": {
               "spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
               "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
               "spark.jars": 
"s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
               "spark.sql.hive.convertMetastoreParquet": "false",
               "spark.hadoop.hoodie.metadata.enable": "true"}
   }
   ```
   
   I made these 2 queries:
   
   spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()
   
   
   %%sql 
   select count('*') from raw_courier_api.order_test
   
   On the pyspark query spark creates a job with 143 tasks, after 10 seconds of 
listing the count was fast, but in the spark sql query spark creates a job with 
2000 tasks and was very slow, is it a Hudi or spark issue?
   
   SPARK SQL
   <img width="1680" alt="Captura de Tela 2021-01-25 às 10 45 16" 
src="https://user-images.githubusercontent.com/36298331/105713972-83bd7a80-5efa-11eb-91e0-b17ca1a3a394.png";>
   
   PYSPARK
   <img width="1680" alt="Captura de Tela 2021-01-25 às 10 47 13" 
src="https://user-images.githubusercontent.com/36298331/105714171-ca12d980-5efa-11eb-8a68-97dc880b2671.png";>
   
   
   Another problem that I got it, my table has 36 million rows, with that 
config shows only 4 million.
   Thank you so much!
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto edited a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

Reply via email to