The parquet format has a "field id" concept (unique integer identifier
for a column) that gets promoted in the C++ implementation to a
key/value pair in the field's metadata.  This has led me to a few
questions around how this field (or metadata in general) interacts
with higher level APIs.

1)

At the moment it appears that metadata survives a simple scan which
seems correct.  It also seems pretty correct that the metadata should
be lost on a complex transformation (e.g. projecting columns 'a' and
'b' into column 'c' = a/b, c should not have any of a or b's
metadata?)

That leaves a large amount of "in between".  Should the metadata be
preserved on a cast?  What about a reordering operation?  What if a
projection leaves the data unchanged but changes the field name?

Is there a good simple rule for this?

2) Do we need to account for the case where a dataset contains
multiple fragments where the fields are in a different order but the
field IDs are consistent?  For example, the first fragment has columns
[a/str, b/int] with field ids [1, 2] and the second fragment has
columns [b/int, a/str] with field ids [2, 1].  Today I'm pretty sure
we would fail to read this dataset.

3) A similar question is what happens if the column types are
consistent but the field IDs are not (e.g. [a/int, b/str] and [a/int,
b/str] with field ids [1, 2] and [2, 1]).  That's probably more
generally tied to schema evolution and I don't think we need to do
anything special there.

Reply via email to