garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-631067632


   @nsivabalan There are a few concerns about using the bloom filter size:
   
   - Even we deduct the size of the bloom filter, there will be other metadata 
and the `totalWriteBytes` is still not representing the total record size. When 
the situation we discussed above happens, it is possible that the small files 
will still be produced.
   - This will increase the complexity when we handle other indexing like 
HbaseIndexing.
   
   I think this estimation will work either we have enough samples or we have 
an accurate total record size. At least one file larger than a small file size 
limit could give us enough samples. With enough samples, the metadata size 
could be neglected. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to