bvaradar commented on issue #2229:
URL: https://github.com/apache/hudi/issues/2229#issuecomment-721198458


   Small Files is an internal name used to denote  the files that are small 
enough where new records(insert) can be written to. In this case, a new version 
of the "small" file is created.
   
   Spark UI retains the name given to previous job groups. Hence, it could be 
misleading to read the job group name. In this case, the job that took the 
longest time is not 'Getting small files from partitions'. I think it is the 
final write to the parquet files that could be taking the time.  
   
   If they are mostly updates, consider MOR table. If they are predominantly 
inserts (with some updates) and you want the inserts to be faster without 
worrying about smaller files getting created, consider turning of small files 
by setting hoodie.parquet.small.file.limit=0.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to