[GitHub] [hudi] xushiyan commented on pull request #7362: [HUDI-5315] The record size is dynamically estimated when the table i…

via GitHub Mon, 10 Apr 2023 21:09:09 -0700


xushiyan commented on PR #7362:
URL: https://github.com/apache/hudi/pull/7362#issuecomment-1502660163


   > > @weimingdiit thanks for making the patch. I see the main problem here is 
that it's using in-memory size for the estimation which is actually intended 
for storage size, which may not be accurate. I have a different approach to 
estimate the size using sample write in [this 
PR](https://github.com/apache/hudi/pull/8390). pls take a look
   > 
   > This is a good way, maybe the estimated size will be more accurate, but I 
think I need to use the data set to test to determine how much error the memory 
estimate will have, and whether this error is acceptable; in addition, I think 
my implementation may be more simpler, does not need to use the file system, 
and will save some overhead of creating/reading files.
   
   @weimingdiit the problem is if the estimation based on in-memory size is too 
much off, then we can't use it to improve the file sizing. have you done any 
experiments to measure the accuracy? base on my experience, the gap will be big 
and that's why we're looking to perform sample writes. making use of file 
system is not that bad - we make use of `.hoodie/.aux/` which is designed for 
auxiliary purpose


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on pull request #7362: [HUDI-5315] The record size is dynamically estimated when the table i…

Reply via email to