ssomuah commented on issue #1852: URL: https://github.com/apache/hudi/issues/1852#issuecomment-663646201
Hi Balaji, I think I've narrowed down my issue somewhat for my MOR table. I started again with a fresh table and the initial commits make sense, but after a time I notice It's consistently trying to write 300+ files. <img width="964" alt="Screen Shot 2020-07-24 at 1 15 17 PM" src="https://user-images.githubusercontent.com/2061955/88417393-da14f980-cdaf-11ea-87ab-63f3aafade83.png"> <img width="1398" alt="Screen Shot 2020-07-24 at 1 15 36 PM" src="https://user-images.githubusercontent.com/2061955/88417402-de411700-cdaf-11ea-85dd-c10c405851d3.png"> <img width="1411" alt="Screen Shot 2020-07-24 at 1 15 52 PM" src="https://user-images.githubusercontent.com/2061955/88417424-e5682500-cdaf-11ea-9c4b-534e27d80c45.png"> The individual tasks don't take that long so I think if I could reduce the number of files it's trying to write it would help. <img width="1409" alt="Screen Shot 2020-07-24 at 1 16 03 PM" src="https://user-images.githubusercontent.com/2061955/88417487-fca71280-cdaf-11ea-9fc0-10a8a074501c.png"> I can also see from the cli that whether it's doing a compaction or a delta commit I still seem to be writing the same number of files for a fraction of the data. <img width="1307" alt="Screen Shot 2020-07-24 at 1 21 36 PM" src="https://user-images.githubusercontent.com/2061955/88417841-aa1a2600-cdb0-11ea-808f-d66595af91ea.png"> Is there something I can tune to reduce the number of files it breaks the data into? hoodie.logfile.max.size is 256MB hoodie.parquet.max.file.size is 256MB hoodie.parquet.compression.ratio is the default .35 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
