[PR] [HUDI-8605] Fix Trino failure when reading corrupted block at end of log file [hudi]

via GitHub Sun, 01 Dec 2024 20:00:48 -0800


usberkeley opened a new pull request, #12393:
URL: https://github.com/apache/hudi/pull/12393


   ### Change Logs
   
   #### Background
   When a corrupted block appears at the end of a Log file, the Trino Reader 
(LogScanner) fails to read it. This is because Hudi attempts to use 
InputStream#seek to locate the end of the LogBlock to check for corruption. 
However, Trino's TrinoInputStream#seek does not necessarily throw an 
EOFException when seeking beyond the end of the file. In some file systems, 
such as AzureInputStream#seek and so on, it may throw an IOException.
   
   Ref:
   **trino-filesystem-azure** AzureInputStream#seek
   ```
       @Override
       public void seek(long newPosition)
               throws IOException
       {
           ensureOpen();
           if (newPosition < 0) {
               throw new IOException("Negative seek offset");
           }
           if (newPosition > fileSize) {
               throw new IOException("Cannot seek to %s. File size is %s: 
%s".formatted(newPosition, fileSize, location));
           }
           nextPosition = newPosition;
       }
   
   #### Solution
   Since we cannot control how the query side handles exceptions when seeking 
beyond the end, it is recommended to:
   By comparing the end position of the LogBlock with the end position of the 
file, we can determine if there is enough space to read the LogBlock.
   
   ### Impact
   
   Fix Trino failure when reading corrupted block at end of log file
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-8605] Fix Trino failure when reading corrupted block at end of log file [hudi]

Reply via email to