Re: [DISCUSS] schema_index

Micah Kornfield Wed, 05 Jun 2024 10:04:12 -0700

>
> For most
> serializers (flatbuffers, protobufs) this means zero cost on the wire



It is not quite zero size on the wire, but it is worth pointing out the
SizeStatistics [1] contains all the information necessary to determine if a
column is all nulls.  Combined with statistics if they are exact, it also
allows one to determine if a column is entirely a single value.  If the
size overhead is reasonable for those two elements, then I think the main
consideration is whether we should be changing the spec at some point to
make writing these columns entirely optional?

Thanks,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845

On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> The drawback with having the reverse mapping is that only empty in all row
> groups columns can be elided. Columns that are empty in some row groups
> can't. I do not have good stats to decide either way.
>
> That said, if we assume there is some post-processing done after
> deserializing FileMetaData, one can build the forward indexing from schema
> to column metadata *per rowgroup* with one pass. Such processing is very
> cheap: it shouldn't take more than a couple dozen micros.
>
>
> Even better if we leave the representation dense, plus make empty
> columns serialize to all default values of ColumnMetaData. For most
> serializers (flatbuffers, protobufs) this means zero cost on the wire. If
> we want to avoid data pages, smart parquet writers can write one page of
> all nulls and point all empty columns to the same page?
>
> On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <jpfi...@gmail.com> wrote:
>
> > I would agree that at least for our use cases, this trade off would not
> be
> > favorable, so we would rather always write some metadata for "empty"
> > columns and therefore get random I/O into the columns array.
> >
> > If I understand the use case correctly though, then this is mostly meant
> > for completely empty columns. Thus, storing this information per row
> group
> > seems unnecessary.
> >
> > *So what about this alternative proposal that actually combines the
> > advantages of both:*
> >
> > How about just turning things around: Instead of having a schema_index in
> > the ColumnMetadata, we could have a column_metadata_index in the schema.
> If
> > that index is missing/-1, then this signifies that the column is empty,
> so
> > no metadata will be present for it. With this, we would get the best of
> > both worlds: We would always have O(1) random I/O even in case of such
> > empty columns (as we would use the column_metadata_index for the lookup)
> > and we would not need to store any ColumnMetadata for empty columns.
> >
> > After given this a second thought, this also makes more sense in general.
> > As the navigation direction is usually always from schema to metadata
> (not
> > vice versa!), the schema should point us to the correct metadata instead
> of
> > the metadata pointing us to the correct schema entry.
> >
> > (I'll post this suggestion also into the PR for reference)
> >
> > Cheers,
> > Jan
> >
> >
> >
> >
> > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou <
> > anto...@python.org
> > >:
> >
> > > On Tue, 4 Jun 2024 10:52:54 +0200
> > > Alkis Evlogimenos
> > > <alkis.evlogime...@databricks.com.INVALID>
> > > wrote:
> > > > >
> > > > > Finally, one point I wanted to highlight here (I also mentioned it
> in
> > > the
> > > > > PR): If we want random access, we have to abolish the concept that
> > the
> > > data
> > > > > in the columns array is in a different order than in the schema.
> Your
> > > PR
> > > > > [1] even added a new field schema_index for matching between
> > > > > ColumnMetaData and schema position, but this kills random access.
> If
> > I
> > > want
> > > > > to read the third column in the schema, then do a O(1) random
> access
> > > into
> > > > > the third column chunk only to notice that it's schema index is
> > totally
> > > > > different and therefore I need a full exhaustive search to find the
> > > column
> > > > > that actually belongs to the third column in the schema, then all
> our
> > > > > random access efforts are in vain.
> > > >
> > > > `schema_index` is useful to implement
> > > > https://issues.apache.org/jira/browse/PARQUET-183 which is more and
> > more
> > > > prevalent as schemata become wider.
> > >
> > > But this means of scan of all column chunk metadata in a row group is
> > > required to know if a particular column exists there? Or am I missing
> > > something?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Re: [DISCUSS] schema_index

Reply via email to