On Tue, 4 Jun 2024 10:52:54 +0200 Alkis Evlogimenos <alkis.evlogime...@databricks.com.INVALID> wrote: > > > > Finally, one point I wanted to highlight here (I also mentioned it in the > > PR): If we want random access, we have to abolish the concept that the data > > in the columns array is in a different order than in the schema. Your PR > > [1] even added a new field schema_index for matching between > > ColumnMetaData and schema position, but this kills random access. If I want > > to read the third column in the schema, then do a O(1) random access into > > the third column chunk only to notice that it's schema index is totally > > different and therefore I need a full exhaustive search to find the column > > that actually belongs to the third column in the schema, then all our > > random access efforts are in vain. > > `schema_index` is useful to implement > https://issues.apache.org/jira/browse/PARQUET-183 which is more and more > prevalent as schemata become wider.
But this means of scan of all column chunk metadata in a row group is required to know if a particular column exists there? Or am I missing something? Regards Antoine.