This is similar to what we do internally to provide non-standard encoding by duplicating data in the customized index pages. It is free to vendor's choice to pay extra storage cost for better encoding support. So I like this idea to support encoding extensions.
Best, Gang On Thu, May 30, 2024 at 8:09 PM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > With the extension point described here: > https://github.com/apache/parquet-format/pull/254 > > We can have vendor encodings without drawbacks. > > For example a vendor wants to add another encoding for integers. It extends > ColumnChunk, and embeds an additional location in the file where the > alternative representation lives. The old encoding is preserved. The > vendor's reader will read the new encoding from a different location in the > file, while other readers will read the old. If and when this new encoding > is accepted as standard, the dual encoding of the column chunk can stop. > > On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <anto...@python.org> > wrote: > > > On Thu, 30 May 2024 00:07:35 -0700 > > Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > A "vendor" encoding would also allow candidate encodings to be shared > > > > accross the ecosystem before they are eventually enchristened as > > regular > > > > encodings in the Thrift metadata. > > > > > > > > > I'm not a huge fan of this for two reasons: > > > 1. I think it makes it much more complicated for end-users to get > > support > > > if they happen to have a file with a custom encoding. There are > already > > > enough rough edges in compatibility between implementations that this > > gives > > > another degree of freedom where things could break. > > > > Agreed, but how is this not a problem for "pluggable" encodings as well? > > > > > 2. From a software supply chain perspective I think this makes > Parquet a > > > lot riskier if it is going to arbitrarily load/invoke code from > > potentially > > > unknown sources. > > > > I'm not sure where that idea comes from. I did *not* suggest that > > implementations load arbitrary code from third-party Github repositories > > :-) > > > > Regards > > > > Antoine. > > > > > > >