Re: [DISCUSS] schema_index

Antoine Pitrou Tue, 04 Jun 2024 05:20:02 -0700

On Tue, 4 Jun 2024 10:52:54 +0200
Alkis Evlogimenos
<alkis.evlogime...@databricks.com.INVALID>
wrote:
> >
> > Finally, one point I wanted to highlight here (I also mentioned it in the
> > PR): If we want random access, we have to abolish the concept that the data
> > in the columns array is in a different order than in the schema. Your PR
> > [1] even added a new field schema_index for matching between
> > ColumnMetaData and schema position, but this kills random access. If I want
> > to read the third column in the schema, then do a O(1) random access into
> > the third column chunk only to notice that it's schema index is totally
> > different and therefore I need a full exhaustive search to find the column
> > that actually belongs to the third column in the schema, then all our
> > random access efforts are in vain.  
> 
> `schema_index` is useful to implement
> https://issues.apache.org/jira/browse/PARQUET-183 which is more and more
> prevalent as schemata become wider.


But this means of scan of all column chunk metadata in a row group is
required to know if a particular column exists there? Or am I missing
something?

Regards

Antoine.

Re: [DISCUSS] schema_index

Reply via email to