rubenssoto opened a new issue #1981:
URL: https://github.com/apache/hudi/issues/1981


   Hi, How are you?
   
   I have two tables in my datalake, a bigger one with 300GB in regular 
parquet, I can execute a simple count in Athena on this table takes 8 seconds
   select count(1) from table
   
   I have another table, smaller one(Hudi Dataset), 47GB the same simple count 
takes 1 minute and 37 seconds in Athena. Both tables are partitioned by date, 
the first table has a lot of small files and the second has one file per 
partition, the bigger file has 600MB.
   
   I really don't understand why performance is so different in athena between 
this tables.
   
   The table one was created by Glue Crawler, the second one by Apache Hudi, I 
saw some differences on Glue Catalog:
   
   This is screenshots from table crawled by Glue, you could saw some tips like 
row count.
   <img width="1324" alt="Captura de Tela 2020-08-18 às 19 35 40" 
src="https://user-images.githubusercontent.com/36298331/90572360-3807e780-e18a-11ea-9ae4-47f51fa28eb0.png";>
   <img width="1439" alt="Captura de Tela 2020-08-18 às 19 36 12" 
src="https://user-images.githubusercontent.com/36298331/90572366-3b02d800-e18a-11ea-84d6-55621cff6f9b.png";>
   
   This is screenshots from Hudi table, there aren't the same tips
   <img width="1412" alt="Captura de Tela 2020-08-18 às 19 37 57" 
src="https://user-images.githubusercontent.com/36298331/90572450-6dacd080-e18a-11ea-9800-001871ee7d4f.png";>
   <img width="1371" alt="Captura de Tela 2020-08-18 às 19 37 34" 
src="https://user-images.githubusercontent.com/36298331/90572455-71405780-e18a-11ea-9a89-c3e0afc206fa.png";>
   
   
   Is this could be the reason of performance difference? And how to solve?
   
   Thank you so much!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to