I would agree that at least for our use cases, this trade off would not be favorable, so we would rather always write some metadata for "empty" columns and therefore get random I/O into the columns array.
If I understand the use case correctly though, then this is mostly meant for completely empty columns. Thus, storing this information per row group seems unnecessary. *So what about this alternative proposal that actually combines the advantages of both:* How about just turning things around: Instead of having a schema_index in the ColumnMetadata, we could have a column_metadata_index in the schema. If that index is missing/-1, then this signifies that the column is empty, so no metadata will be present for it. With this, we would get the best of both worlds: We would always have O(1) random I/O even in case of such empty columns (as we would use the column_metadata_index for the lookup) and we would not need to store any ColumnMetadata for empty columns. After given this a second thought, this also makes more sense in general. As the navigation direction is usually always from schema to metadata (not vice versa!), the schema should point us to the correct metadata instead of the metadata pointing us to the correct schema entry. (I'll post this suggestion also into the PR for reference) Cheers, Jan Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou <anto...@python.org >: > On Tue, 4 Jun 2024 10:52:54 +0200 > Alkis Evlogimenos > <alkis.evlogime...@databricks.com.INVALID> > wrote: > > > > > > Finally, one point I wanted to highlight here (I also mentioned it in > the > > > PR): If we want random access, we have to abolish the concept that the > data > > > in the columns array is in a different order than in the schema. Your > PR > > > [1] even added a new field schema_index for matching between > > > ColumnMetaData and schema position, but this kills random access. If I > want > > > to read the third column in the schema, then do a O(1) random access > into > > > the third column chunk only to notice that it's schema index is totally > > > different and therefore I need a full exhaustive search to find the > column > > > that actually belongs to the third column in the schema, then all our > > > random access efforts are in vain. > > > > `schema_index` is useful to implement > > https://issues.apache.org/jira/browse/PARQUET-183 which is more and more > > prevalent as schemata become wider. > > But this means of scan of all column chunk metadata in a row group is > required to know if a particular column exists there? Or am I missing > something? > > Regards > > Antoine. > > >