joe-ucp opened a new pull request, #8894:
URL: https://github.com/apache/arrow-rs/pull/8894

   Goal
   Introduce `Vec<Vec<Option<...>>>` for column and offset page indexes and 
update the parquet crate so it compiles and existing behavior continues to 
work, while keeping semantics changes as small and explicit as possible.
   
   Which issue does this PR close?
   Part of #8818.
   
   Rationale for this change
   The Parquet format allows page indexes (column index and offset index) to be 
optional per column chunk. The existing Rust types:
   
   - `pub type ParquetColumnIndex = Vec<Vec<ColumnIndexMetaData>>`
   - `pub type ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>`
   
   implicitly assume that once page indexes are loaded, every column chunk has 
an index. This makes it awkward (and sometimes impossible) to represent 
partially missing indexes and encourages `unwrap()`-based code paths.
   
   This PR introduces an `Option` at the leaf of these containers so we can 
explicitly represent “no index for this (row_group, column)” and prepare the 
codebase for safer semantics in follow-up PRs.
   
   What changes are included in this PR?
   - Type alias changes
   
     - Change the page index aliases to:
   
       ```rust
       pub type ParquetColumnIndex = Vec<Vec<Option<ColumnIndexMetaData>>>;
       pub type ParquetOffsetIndex = Vec<Vec<Option<OffsetIndexMetaData>>>;
       ```
   
     - Clarify the documentation comments for these aliases to mention `None` 
at the leaf as “no index for this column chunk”.
   
   - Minimal plumbing to keep existing code compiling
   
     - Update internal metadata builders and writers to construct 
`ParquetColumnIndex` and `ParquetOffsetIndex` with `Some(...)` where indexes 
are currently generated.
     - Update call sites that indexed into `[row_group][column]` and previously 
saw a bare `ColumnIndexMetaData` or `OffsetIndexMetaData` to accept an 
`Option<...>`. In this PR we mostly:
       - Wrap existing values in `Some(...)`, and
       - Where behavior previously assumed an index was always present, use 
`as_ref().expect(...)` to preserve the invariant and make the assumptions 
explicit in error messages.
   
   - No behavioral logic changes yet
   
     - This PR intentionally avoids changing how missing indices are 
*interpreted* (for example, `ColumnIndexMetaData::NONE` is preserved where it 
was used before).
     - More nuanced handling of `None` vs `ColumnIndexMetaData::NONE`, and 
graceful behavior for partially missing indices, is left for later PRs.
   
   Are these changes tested?
   Yes.
   
   - Existing parquet tests pass with the new types and plumbing in place:
   
   
     cargo test --package parquet --lib
   
   No new behavior-focused tests are introduced in this PR; the intent is to 
introduce the new in-memory representation with minimal semantic changes and 
rely on existing tests to confirm that behavior is preserved.
   
   Are there any user-facing changes?
   Yes, at the type level:
   
   Public aliases:
   
   ParquetColumnIndex is now Vec<Vec<Option<ColumnIndexMetaData>>>
   
   ParquetOffsetIndex is now Vec<Vec<Option<OffsetIndexMetaData>>>
   
   Downstream code that accessed these aliases as indexes[row_group][column] 
will need to handle Option, e.g.:
   
   
   if let Some(idx) = indexes[rg][col].as_ref() {
       // use idx
   }
   The Parquet on-disk format and footer encoding are unchanged.
   
   Semantically, the goal of this PR is “same behavior, new types”; more 
meaningful changes in how missing indexes are represented and handled are done 
in follow-up PRs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to