Shekharrajak opened a new pull request, #6917: URL: https://github.com/apache/paimon/pull/6917
[core] Support generating splits with finer granularity than file level ### Purpose Linked issue: close #5012 This PR implements support for generating splits at finer granularity than file level (e.g., row groups for Parquet, stripes for ORC) to significantly enhance concurrency when reading large files. This follows the proven pattern used by Spark and Flink, where files are split at natural boundaries (row groups/stripes) rather than file boundaries. The implementation leverages existing `RawFile` infrastructure with `offset` and `length` fields, ensuring backward compatibility while enabling improved parallelism for large file reads. ### Tests Unit tests and integration tests are needed for: - `ParquetMetadataReader`: Verify row group boundary extraction - `OrcMetadataReader`: Verify stripe boundary extraction - `FineGrainedSplitGenerator`: Verify split generation logic - `ParquetReaderFactory.createReader(offset, length)`: Verify range-based reading - `OrcReaderFactory.createReader(offset, length)`: Verify range-based reading ### API and Format **New Configuration Options:** - `source.split.file-enabled`: Enable finer-grained file splitting (default: `false`) - `source.split.file-threshold`: Minimum file size to consider splitting (default: `128MB`) - `source.split.file-max-splits`: Maximum splits per file (default: `100`) **New Interfaces/Classes:** - `FormatMetadataReader`: Interface for reading format-specific metadata - `FileSplitBoundary`: Represents split boundaries (offset, length, rowCount) - `ParquetMetadataReader`: Extracts row group boundaries from Parquet files - `OrcMetadataReader`: Extracts stripe boundaries from ORC files - `FineGrainedSplitGenerator`: Decorator for `SplitGenerator` that enables fine-grained splitting **Extended Interfaces:** - `FormatReaderFactory.createReader(Context, offset, length)`: Now implemented for Parquet and ORC - `DataSplit`: Added transient `fileSplitBoundaries` field (not serialized for backward compatibility) **Storage Format:** No changes to storage format. This is a read-time optimization that doesn't affect how data is written or stored. ### Documentation This change introduces a new feature that should be documented: 1. **Configuration Guide**: Document the new `source.split.file-enabled` and related options 2. **Performance Tuning Guide**: Explain when and how to use fine-grained splitting for optimal performance 3. **API Documentation**: Document the new `FormatMetadataReader` interface and implementations The feature is disabled by default to maintain backward compatibility. Users can enable it by setting `source.split.file-enabled=true` in their table options. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
