I think it would be a good idea to have an extension mechanism that allows embedding extra information in the format. Something akin to what Alkis is suggesting having a reserved extension point. - The file can still be read by a standard parquet implementation without extra libraries - Vendors can embed custom indices, duplicate data in a proprietary encoding, add extra metadata while remaining compatible. There are probably a few implementations that add metadata in this way adding unused thrift ids (and hoping they won't be used)
It respects the "fully specified" nature of Parquet and you won't have weird files you can't read without an opaque library. However it codifies how you add extra information in place. On Thu, May 30, 2024 at 7:21 AM Gang Wu <ust...@gmail.com> wrote: > This is similar to what we do internally to provide non-standard encoding > by duplicating data in the customized index pages. It is free to vendor's > choice to pay extra storage cost for better encoding support. So I like > this > idea to support encoding extensions. > > Best, > Gang > > On Thu, May 30, 2024 at 8:09 PM Alkis Evlogimenos > <alkis.evlogime...@databricks.com.invalid> wrote: > > > With the extension point described here: > > https://github.com/apache/parquet-format/pull/254 > > > > We can have vendor encodings without drawbacks. > > > > For example a vendor wants to add another encoding for integers. It > extends > > ColumnChunk, and embeds an additional location in the file where the > > alternative representation lives. The old encoding is preserved. The > > vendor's reader will read the new encoding from a different location in > the > > file, while other readers will read the old. If and when this new > encoding > > is accepted as standard, the dual encoding of the column chunk can stop. > > > > On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > On Thu, 30 May 2024 00:07:35 -0700 > > > Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > A "vendor" encoding would also allow candidate encodings to be > shared > > > > > accross the ecosystem before they are eventually enchristened as > > > regular > > > > > encodings in the Thrift metadata. > > > > > > > > > > > > I'm not a huge fan of this for two reasons: > > > > 1. I think it makes it much more complicated for end-users to get > > > support > > > > if they happen to have a file with a custom encoding. There are > > already > > > > enough rough edges in compatibility between implementations that this > > > gives > > > > another degree of freedom where things could break. > > > > > > Agreed, but how is this not a problem for "pluggable" encodings as > well? > > > > > > > 2. From a software supply chain perspective I think this makes > > Parquet a > > > > lot riskier if it is going to arbitrarily load/invoke code from > > > potentially > > > > unknown sources. > > > > > > I'm not sure where that idea comes from. I did *not* suggest that > > > implementations load arbitrary code from third-party Github > repositories > > > :-) > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > >