This is similar to what we do internally to provide non-standard encoding
by duplicating data in the customized index pages. It is free to vendor's
choice to pay extra storage cost for better encoding support. So I like this
idea to support encoding extensions.

Best,
Gang

On Thu, May 30, 2024 at 8:09 PM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> With the extension point described here:
> https://github.com/apache/parquet-format/pull/254
>
> We can have vendor encodings without drawbacks.
>
> For example a vendor wants to add another encoding for integers. It extends
> ColumnChunk, and embeds an additional location in the file where the
> alternative representation lives. The old encoding is preserved. The
> vendor's reader will read the new encoding from a different location in the
> file, while other readers will read the old. If and when this new encoding
> is accepted as standard, the dual encoding of the column chunk can stop.
>
> On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <anto...@python.org>
> wrote:
>
> > On Thu, 30 May 2024 00:07:35 -0700
> > Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > > > A "vendor" encoding would also allow candidate encodings to be shared
> > > > accross the ecosystem before they are eventually enchristened as
> > regular
> > > > encodings in the Thrift metadata.
> > >
> > >
> > > I'm not a huge fan of this for two reasons:
> > > 1.  I think it makes it much more complicated for end-users to get
> > support
> > > if they happen to have a file with a custom encoding.  There are
> already
> > > enough rough edges in compatibility between implementations that this
> > gives
> > > another degree of freedom where things could break.
> >
> > Agreed, but how is this not a problem for "pluggable" encodings as well?
> >
> > > 2.  From a software supply chain perspective I think this makes
> Parquet a
> > > lot riskier if it is going to arbitrarily load/invoke code from
> > potentially
> > > unknown sources.
> >
> > I'm not sure where that idea comes from. I did *not* suggest that
> > implementations load arbitrary code from third-party Github repositories
> > :-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Reply via email to