[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

GitBox Thu, 14 May 2020 16:16:27 -0700


garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628934924



   > commit1 only wrote 1 record but the parquet file is 20MB 
   
   @vinothchandar Sorry this example is bad... Let's say 8MB(2M entries) bloom 
filter + 200 records producing a 10MB parquet. If these 200 records are 
assigned to an existing partition, it will be very likely inserted into the 
existing file, so no problem. But if it goes to a new partition, then this 
small file is inevitable. 
   
   In the next run, we consider this 10MB file as a small file. We calculate 
`averageRecordSize = 10MB/200 = 50KB`, let's say we set max parquet size to 
100MB, `(100MB - 10MB)/50KB = 1800` records to file this small file. For other 
files, each will be assigned 2000 records. 
   
   Yes, we can never get a pure record size. Even we deduct the bloom filter 
size in this case, with other metadata, we still have `(2MB/200) = 10KB`, which 
will produce a bit larger small files... 
   
   Another idea in my mind:
   
   - if the `totalBytesWritten` is less than the 
`DEFAULT_PARQUET_SMALL_FILE_LIMIT_BYTES`, then skip calculating size from this 
commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

Reply via email to