vburenin commented on a change in pull request #2440:
URL: https://github.com/apache/hudi/pull/2440#discussion_r559284563
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
##########
@@ -274,19 +275,27 @@ private boolean isBlockCorrupt(int blocksize) throws
IOException {
}
private long scanForNextAvailableBlockOffset() throws IOException {
+ // Make buffer large enough to scan through the file as quick as possible
especially if it is on S3/GCS.
+ // Using lower buffer is incurring a lot of API calls thus drastically
increasing the cost of the storage
+ // and also may take days to complete scanning trough the large files.
+ byte[] dataBuf = new byte[1024 * 1024];
Review comment:
All that simple logic runs every time for every byte offset, it all adds
up no matter what, I also forgot to mention that we iterate over the data
changing position just for 1 byte to copy next 6 bytes every time, so that it
can potentially copy 6 times more data, which is not ideal.
I will add a BufferedInputStream as soon as I get back to work on Tuesday.
It is super hard to find even 5 minutes when all kids are home.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]