I reached out and got some feedback[1][2].  I think I've reached the
conclusion that metadata is schema/control and compute is data.  With
that in mind I would argue the compute layer can (perhaps should?)
always discard metadata.  If a user is performing some query like
"SELECT a/b AS c FROM table" and they want the resulting column to
have some kind of metadata (e.g. explaining that c is a dynamic column
based on a and b) then the generation of that combined metadata would
belong to either the user, or the layer converting the query to
execution plan, but it is not a responsibility of the compute layer.

[1] 
https://lists.apache.org/x/thread.html/r3396d802cb1b59c4f650f427f93f58290c5039995eac58f0a5459260@%3Cdev.iceberg.apache.org%3E
[2] 
https://lists.apache.org/x/thread.html/rb053bbc19e8a75802a9fe3efd2905df725df7cb7a76968ae81bd6903@%3Cdev.parquet.apache.org%3E

On Thu, May 13, 2021 at 5:52 AM Wes McKinney <[email protected]> wrote:
>
> Since the projects this is relevant for include things like Iceberg
> which utilize the Parquet field ids, so can we reach out to those
> communities (dev@parquet and dev@iceberg) to solicit their feedback?
>
> On Wed, May 12, 2021 at 2:21 PM Antoine Pitrou <[email protected]> wrote:
> >
> >
> > Le 12/05/2021 à 21:19, Weston Pace a écrit :
> > > The parquet format has a "field id" concept (unique integer identifier
> > > for a column) that gets promoted in the C++ implementation to a
> > > key/value pair in the field's metadata.
> >
> > I don't think anything says the "field id" should be unique. It's just
> > an opaque application-specific identifier.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> >    This has led me to a few
> > > questions around how this field (or metadata in general) interacts
> > > with higher level APIs.
> > >
> > > 1)
> > >
> > > At the moment it appears that metadata survives a simple scan which
> > > seems correct.  It also seems pretty correct that the metadata should
> > > be lost on a complex transformation (e.g. projecting columns 'a' and
> > > 'b' into column 'c' = a/b, c should not have any of a or b's
> > > metadata?)
> > >
> > > That leaves a large amount of "in between".  Should the metadata be
> > > preserved on a cast?  What about a reordering operation?  What if a
> > > projection leaves the data unchanged but changes the field name?
> > >
> > > Is there a good simple rule for this?
> > >
> > > 2) Do we need to account for the case where a dataset contains
> > > multiple fragments where the fields are in a different order but the
> > > field IDs are consistent?  For example, the first fragment has columns
> > > [a/str, b/int] with field ids [1, 2] and the second fragment has
> > > columns [b/int, a/str] with field ids [2, 1].  Today I'm pretty sure
> > > we would fail to read this dataset.
> > >
> > > 3) A similar question is what happens if the column types are
> > > consistent but the field IDs are not (e.g. [a/int, b/str] and [a/int,
> > > b/str] with field ids [1, 2] and [2, 1]).  That's probably more
> > > generally tied to schema evolution and I don't think we need to do
> > > anything special there.
> > >

Reply via email to