[I] [format] ParquetInputStream doesn't support vectored reads [paimon]

via GitHub Sat, 22 Nov 2025 00:13:08 -0800


tchivs opened a new issue, #6657:
URL: https://github.com/apache/paimon/issues/6657


   ## Description
   
   `ParquetInputStream` currently doesn't override the `readVectored()` method 
from its parent class `DelegatingSeekableInputStream`, causing it to fall back 
to the default implementation that throws `UnsupportedOperationException`.
   
   ## Problem
   
   When `ParquetFileReader` attempts to use vectored reads for improved I/O 
performance, the operation fails because:
   
   1. **ParquetFileReader** calls `readVectored()` on the underlying stream 
(line 667 in ParquetFileReader.java)
   2. The stream is `org.apache.parquet.io.SeekableInputStream` which has a 
default implementation throwing `UnsupportedOperationException`
   3. **ParquetInputStream** (Paimon's wrapper) extends 
`DelegatingSeekableInputStream` but doesn't override `readVectored()`
   4. This causes the exception to be thrown, preventing vectored reads from 
working
   
   ## Impact
   
   - **Performance**: Cannot leverage Parquet's vectored read optimization for 
parallel I/O
   - **Efficiency**: Falls back to sequential reads even when the underlying 
FileIO supports vectored reads
   - **Cloud Storage**: Missing optimization opportunities for S3, OSS, and 
other cloud storage systems that benefit from batch reads
   
   ## Root Cause
   
   The gap exists between two interface systems:
   
   **Paimon's Interface**:
   - `VectoredReadable` interface with `readVectored(List<FileRange>)` 
   - `FileRange` uses `CompletableFuture<byte[]>` for async results
   
   **Parquet's Interface**:
   - `SeekableInputStream.readVectored(List<ParquetFileRange>, 
ByteBufferAllocator)`
   - `ParquetFileRange` uses `CompletableFuture<ByteBuffer>` for async results
   
   **ParquetInputStream** needs to bridge these two interfaces.
   
   ## Proposed Solution
   
   Implement `readVectored()` in `ParquetInputStream` to:
   
   1. **Check capability**: Detect if underlying stream supports 
`VectoredReadable`
   2. **Convert ranges**: Transform `ParquetFileRange` to `FileRange` 
   3. **Delegate to Paimon**: Use Paimon's `VectoredReadable.readVectored()` 
   4. **Transform data**: Convert `CompletableFuture<byte[]>` to 
`CompletableFuture<ByteBuffer>`
   5. **Fallback**: Use serial reads when vectored reads are unavailable
   
   ## Benefits
   
   ✅ **Performance**: Enable vectored reads for Parquet files in Paimon
   ✅ **Compatibility**: Work with both vectored and non-vectored FileIO 
implementations  
   ✅ **Cloud Optimization**: Better I/O performance on S3, OSS, Azure, GCS
   ✅ **Backward Compatible**: Graceful fallback for older FileIO implementations
   
   ## Testing
   
   Comprehensive test coverage will include:
   - Vectored reads with `VectoredReadable` support
   - Fallback to serial reads without vectored support
   - Empty ranges handling
   - End-to-end testing with real Parquet files
   
   ## Related Files
   
   - 
`paimon-format/src/main/java/org/apache/paimon/format/parquet/ParquetInputStream.java`
   - 
`paimon-format/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java`
   - `paimon-common/src/main/java/org/apache/paimon/fs/VectoredReadable.java`
   - `paimon-common/src/main/java/org/apache/paimon/fs/FileRange.java`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [format] ParquetInputStream doesn't support vectored reads [paimon]

Reply via email to