Re: [PR] [thrift-remodel] PoC new form for column index [arrow-rs]

via GitHub Wed, 27 Aug 2025 07:14:23 -0700


alamb commented on PR #8191:
URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3228387315


   > And while we're breaking things, do we actually like the column and page 
indexes as Vec<Vec<index>>? 
   
   I do not like this structure and find it mind mending any time I have to 
deal with it
   
   > I'm wondering if tacking them individually onto the ColumnChunkMetaData as 
Option<index> would be more intuitive?
   
   I do think this would be great (even though then the rust metadata doesn't 
match the physical representation in the Parquetfile).  If we could do the same 
with BloomFilters that would also resolve an outstanding wart with the API.
   
   The trick will be to have reasonable APIs to add/strip these extra index 
structures from the RowGroupMetadata when needed
   
   > This is part something DataFusion could do, part up to Parquet itself. In 
my mind one of the the biggest mistakes / bottlenecks is trying to load row 
group stats as part of the thrift data. 
   > IMO the footer / thrift data should be basic metadata (the schema) + data 
offsets + index type and index offsets. Then the index types / offsets are used 
to load page indexes, bloom filters, row group zone maps and future types of 
indexes. But that's a very ambitious goal that I don't have the bandwidth to 
push for atm.
   
   I don't think we can change the location of the row group statistics (as 
that is part of the format)
   
   What we *could* do is change the metadata parser so that it can skip parsing 
some/all of the row group statistics on the initial pass, and then have an 
additional API to parse the statistics when we know they will actually be 
needed. 
   
   This is my understanding of what @adrian-thurston suggested in this 
DataFusion issue:
   - https://github.com/apache/datafusion/issues/16200
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [thrift-remodel] PoC new form for column index [arrow-rs]

Reply via email to