Shekharrajak opened a new pull request, #6917:
URL: https://github.com/apache/paimon/pull/6917

   [core] Support generating splits with finer granularity than file level
   
   ### Purpose
   
   Linked issue: close #5012
   
   This PR implements support for generating splits at finer granularity than 
file level (e.g., row groups for Parquet, stripes for ORC) to significantly 
enhance concurrency when reading large files. This follows the proven pattern 
used by Spark and Flink, where files are split at natural boundaries (row 
groups/stripes) rather than file boundaries.
   
   The implementation leverages existing `RawFile` infrastructure with `offset` 
and `length` fields, ensuring backward compatibility while enabling improved 
parallelism for large file reads.
   
   ### Tests
   
   Unit tests and integration tests are needed for:
   - `ParquetMetadataReader`: Verify row group boundary extraction
   - `OrcMetadataReader`: Verify stripe boundary extraction
   - `FineGrainedSplitGenerator`: Verify split generation logic
   - `ParquetReaderFactory.createReader(offset, length)`: Verify range-based 
reading
   - `OrcReaderFactory.createReader(offset, length)`: Verify range-based reading
   
   
   ### API and Format
   
   **New Configuration Options:**
   - `source.split.file-enabled`: Enable finer-grained file splitting (default: 
`false`)
   - `source.split.file-threshold`: Minimum file size to consider splitting 
(default: `128MB`)
   - `source.split.file-max-splits`: Maximum splits per file (default: `100`)
   
   **New Interfaces/Classes:**
   - `FormatMetadataReader`: Interface for reading format-specific metadata
   - `FileSplitBoundary`: Represents split boundaries (offset, length, rowCount)
   - `ParquetMetadataReader`: Extracts row group boundaries from Parquet files
   - `OrcMetadataReader`: Extracts stripe boundaries from ORC files
   - `FineGrainedSplitGenerator`: Decorator for `SplitGenerator` that enables 
fine-grained splitting
   
   **Extended Interfaces:**
   - `FormatReaderFactory.createReader(Context, offset, length)`: Now 
implemented for Parquet and ORC
   - `DataSplit`: Added transient `fileSplitBoundaries` field (not serialized 
for backward compatibility)
   
   **Storage Format:** No changes to storage format. This is a read-time 
optimization that doesn't affect how data is written or stored.
   
   ### Documentation
   
   This change introduces a new feature that should be documented:
   
   1. **Configuration Guide**: Document the new `source.split.file-enabled` and 
related options
   2. **Performance Tuning Guide**: Explain when and how to use fine-grained 
splitting for optimal performance
   3. **API Documentation**: Document the new `FormatMetadataReader` interface 
and implementations
   
   The feature is disabled by default to maintain backward compatibility. Users 
can enable it by setting `source.split.file-enabled=true` in their table 
options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to