weimingdiit commented on PR #7362:
URL: https://github.com/apache/hudi/pull/7362#issuecomment-1502785581

   > > > @weimingdiit thanks for making the patch. I see the main problem here 
is that it's using in-memory size for the estimation which is actually intended 
for storage size, which may not be accurate. I have a different approach to 
estimate the size using sample write in [this 
PR](https://github.com/apache/hudi/pull/8390). pls take a look
   > > 
   > > 
   > > This is a good way, maybe the estimated size will be more accurate, but 
I think I need to use the data set to test to determine how much error the 
memory estimate will have, and whether this error is acceptable; in addition, I 
think my implementation may be more simpler, does not need to use the file 
system, and will save some overhead of creating/reading files.
   > 
   > @weimingdiit the problem is if the estimation based on in-memory size is 
too much off, then we can't use it to improve the file sizing. have you done 
any experiments to measure the accuracy? base on my experience, the gap will be 
big and that's why we're looking to perform sample writes. making use of file 
system is not that bad - we make use of `.hoodie/.aux/` which is designed for 
auxiliary purpose
   
   @xushiyan  thanks xu, I think your method is indeed more accurate, When I 
test 10W pieces of data locally, there is a gap of 20% to 30%. maybe I should 
close this PR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to