bvaradar commented on issue #2229: URL: https://github.com/apache/hudi/issues/2229#issuecomment-721198458
Small Files is an internal name used to denote the files that are small enough where new records(insert) can be written to. In this case, a new version of the "small" file is created. Spark UI retains the name given to previous job groups. Hence, it could be misleading to read the job group name. In this case, the job that took the longest time is not 'Getting small files from partitions'. I think it is the final write to the parquet files that could be taking the time. If they are mostly updates, consider MOR table. If they are predominantly inserts (with some updates) and you want the inserts to be faster without worrying about smaller files getting created, consider turning of small files by setting hoodie.parquet.small.file.limit=0. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
