garyli1019 commented on pull request #1602: URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628934924
> commit1 only wrote 1 record but the parquet file is 20MB @vinothchandar Sorry this example is bad... Let's say 8MB(2M entries) bloom filter + 200 records producing a 10MB parquet. If these 200 records are assigned to an existing partition, it will be very likely inserted into the existing file, so no problem. But if it goes to a new partition, then this small file is inevitable. In the next run, we consider this 10MB file as a small file. We calculate `averageRecordSize = 10MB/200 = 50KB`, let's say we set max parquet size to 100MB, `(100MB - 10MB)/50KB = 1800` records to file this small file. For other files, each will be assigned 2000 records. Yes, we can never get a pure record size. Even we deduct the bloom filter size in this case, with other metadata, we still have `(2MB/200) = 10KB`, which will produce a bit larger small files... Another idea in my mind: - if the `totalBytesWritten` is less than the `DEFAULT_PARQUET_SMALL_FILE_LIMIT_BYTES`, then skip calculating size from this commit. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
