xushiyan commented on PR #7362: URL: https://github.com/apache/hudi/pull/7362#issuecomment-1502660163
> > @weimingdiit thanks for making the patch. I see the main problem here is that it's using in-memory size for the estimation which is actually intended for storage size, which may not be accurate. I have a different approach to estimate the size using sample write in [this PR](https://github.com/apache/hudi/pull/8390). pls take a look > > This is a good way, maybe the estimated size will be more accurate, but I think I need to use the data set to test to determine how much error the memory estimate will have, and whether this error is acceptable; in addition, I think my implementation may be more simpler, does not need to use the file system, and will save some overhead of creating/reading files. @weimingdiit the problem is if the estimation based on in-memory size is too much off, then we can't use it to improve the file sizing. have you done any experiments to measure the accuracy? base on my experience, the gap will be big and that's why we're looking to perform sample writes. making use of file system is not that bad - we make use of `.hoodie/.aux/` which is designed for auxiliary purpose -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
