tchivs opened a new issue, #6657: URL: https://github.com/apache/paimon/issues/6657
## Description `ParquetInputStream` currently doesn't override the `readVectored()` method from its parent class `DelegatingSeekableInputStream`, causing it to fall back to the default implementation that throws `UnsupportedOperationException`. ## Problem When `ParquetFileReader` attempts to use vectored reads for improved I/O performance, the operation fails because: 1. **ParquetFileReader** calls `readVectored()` on the underlying stream (line 667 in ParquetFileReader.java) 2. The stream is `org.apache.parquet.io.SeekableInputStream` which has a default implementation throwing `UnsupportedOperationException` 3. **ParquetInputStream** (Paimon's wrapper) extends `DelegatingSeekableInputStream` but doesn't override `readVectored()` 4. This causes the exception to be thrown, preventing vectored reads from working ## Impact - **Performance**: Cannot leverage Parquet's vectored read optimization for parallel I/O - **Efficiency**: Falls back to sequential reads even when the underlying FileIO supports vectored reads - **Cloud Storage**: Missing optimization opportunities for S3, OSS, and other cloud storage systems that benefit from batch reads ## Root Cause The gap exists between two interface systems: **Paimon's Interface**: - `VectoredReadable` interface with `readVectored(List<FileRange>)` - `FileRange` uses `CompletableFuture<byte[]>` for async results **Parquet's Interface**: - `SeekableInputStream.readVectored(List<ParquetFileRange>, ByteBufferAllocator)` - `ParquetFileRange` uses `CompletableFuture<ByteBuffer>` for async results **ParquetInputStream** needs to bridge these two interfaces. ## Proposed Solution Implement `readVectored()` in `ParquetInputStream` to: 1. **Check capability**: Detect if underlying stream supports `VectoredReadable` 2. **Convert ranges**: Transform `ParquetFileRange` to `FileRange` 3. **Delegate to Paimon**: Use Paimon's `VectoredReadable.readVectored()` 4. **Transform data**: Convert `CompletableFuture<byte[]>` to `CompletableFuture<ByteBuffer>` 5. **Fallback**: Use serial reads when vectored reads are unavailable ## Benefits ✅ **Performance**: Enable vectored reads for Parquet files in Paimon ✅ **Compatibility**: Work with both vectored and non-vectored FileIO implementations ✅ **Cloud Optimization**: Better I/O performance on S3, OSS, Azure, GCS ✅ **Backward Compatible**: Graceful fallback for older FileIO implementations ## Testing Comprehensive test coverage will include: - Vectored reads with `VectoredReadable` support - Fallback to serial reads without vectored support - Empty ranges handling - End-to-end testing with real Parquet files ## Related Files - `paimon-format/src/main/java/org/apache/paimon/format/parquet/ParquetInputStream.java` - `paimon-format/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java` - `paimon-common/src/main/java/org/apache/paimon/fs/VectoredReadable.java` - `paimon-common/src/main/java/org/apache/paimon/fs/FileRange.java` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
