usberkeley opened a new pull request, #12393:
URL: https://github.com/apache/hudi/pull/12393
### Change Logs
#### Background
When a corrupted block appears at the end of a Log file, the Trino Reader
(LogScanner) fails to read it. This is because Hudi attempts to use
InputStream#seek to locate the end of the LogBlock to check for corruption.
However, Trino's TrinoInputStream#seek does not necessarily throw an
EOFException when seeking beyond the end of the file. In some file systems,
such as AzureInputStream#seek and so on, it may throw an IOException.
Ref:
**trino-filesystem-azure** AzureInputStream#seek
```
@Override
public void seek(long newPosition)
throws IOException
{
ensureOpen();
if (newPosition < 0) {
throw new IOException("Negative seek offset");
}
if (newPosition > fileSize) {
throw new IOException("Cannot seek to %s. File size is %s:
%s".formatted(newPosition, fileSize, location));
}
nextPosition = newPosition;
}
#### Solution
Since we cannot control how the query side handles exceptions when seeking
beyond the end, it is recommended to:
By comparing the end position of the LogBlock with the end position of the
file, we can determine if there is enough space to read the LogBlock.
### Impact
Fix Trino failure when reading corrupted block at end of log file
### Risk level (write none, low medium or high below)
low
### Documentation Update
none
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]