> > For most > serializers (flatbuffers, protobufs) this means zero cost on the wire
It is not quite zero size on the wire, but it is worth pointing out the SizeStatistics [1] contains all the information necessary to determine if a column is all nulls. Combined with statistics if they are exact, it also allows one to determine if a column is entirely a single value. If the size overhead is reasonable for those two elements, then I think the main consideration is whether we should be changing the spec at some point to make writing these columns entirely optional? Thanks, Micah [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845 On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > The drawback with having the reverse mapping is that only empty in all row > groups columns can be elided. Columns that are empty in some row groups > can't. I do not have good stats to decide either way. > > That said, if we assume there is some post-processing done after > deserializing FileMetaData, one can build the forward indexing from schema > to column metadata *per rowgroup* with one pass. Such processing is very > cheap: it shouldn't take more than a couple dozen micros. > > > Even better if we leave the representation dense, plus make empty > columns serialize to all default values of ColumnMetaData. For most > serializers (flatbuffers, protobufs) this means zero cost on the wire. If > we want to avoid data pages, smart parquet writers can write one page of > all nulls and point all empty columns to the same page? > > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <jpfi...@gmail.com> wrote: > > > I would agree that at least for our use cases, this trade off would not > be > > favorable, so we would rather always write some metadata for "empty" > > columns and therefore get random I/O into the columns array. > > > > If I understand the use case correctly though, then this is mostly meant > > for completely empty columns. Thus, storing this information per row > group > > seems unnecessary. > > > > *So what about this alternative proposal that actually combines the > > advantages of both:* > > > > How about just turning things around: Instead of having a schema_index in > > the ColumnMetadata, we could have a column_metadata_index in the schema. > If > > that index is missing/-1, then this signifies that the column is empty, > so > > no metadata will be present for it. With this, we would get the best of > > both worlds: We would always have O(1) random I/O even in case of such > > empty columns (as we would use the column_metadata_index for the lookup) > > and we would not need to store any ColumnMetadata for empty columns. > > > > After given this a second thought, this also makes more sense in general. > > As the navigation direction is usually always from schema to metadata > (not > > vice versa!), the schema should point us to the correct metadata instead > of > > the metadata pointing us to the correct schema entry. > > > > (I'll post this suggestion also into the PR for reference) > > > > Cheers, > > Jan > > > > > > > > > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou < > > anto...@python.org > > >: > > > > > On Tue, 4 Jun 2024 10:52:54 +0200 > > > Alkis Evlogimenos > > > <alkis.evlogime...@databricks.com.INVALID> > > > wrote: > > > > > > > > > > Finally, one point I wanted to highlight here (I also mentioned it > in > > > the > > > > > PR): If we want random access, we have to abolish the concept that > > the > > > data > > > > > in the columns array is in a different order than in the schema. > Your > > > PR > > > > > [1] even added a new field schema_index for matching between > > > > > ColumnMetaData and schema position, but this kills random access. > If > > I > > > want > > > > > to read the third column in the schema, then do a O(1) random > access > > > into > > > > > the third column chunk only to notice that it's schema index is > > totally > > > > > different and therefore I need a full exhaustive search to find the > > > column > > > > > that actually belongs to the third column in the schema, then all > our > > > > > random access efforts are in vain. > > > > > > > > `schema_index` is useful to implement > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is more and > > more > > > > prevalent as schemata become wider. > > > > > > But this means of scan of all column chunk metadata in a row group is > > > required to know if a particular column exists there? Or am I missing > > > something? > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > >