The parquet format has a "field id" concept (unique integer identifier for a column) that gets promoted in the C++ implementation to a key/value pair in the field's metadata. This has led me to a few questions around how this field (or metadata in general) interacts with higher level APIs.
1) At the moment it appears that metadata survives a simple scan which seems correct. It also seems pretty correct that the metadata should be lost on a complex transformation (e.g. projecting columns 'a' and 'b' into column 'c' = a/b, c should not have any of a or b's metadata?) That leaves a large amount of "in between". Should the metadata be preserved on a cast? What about a reordering operation? What if a projection leaves the data unchanged but changes the field name? Is there a good simple rule for this? 2) Do we need to account for the case where a dataset contains multiple fragments where the fields are in a different order but the field IDs are consistent? For example, the first fragment has columns [a/str, b/int] with field ids [1, 2] and the second fragment has columns [b/int, a/str] with field ids [2, 1]. Today I'm pretty sure we would fail to read this dataset. 3) A similar question is what happens if the column types are consistent but the field IDs are not (e.g. [a/int, b/str] and [a/int, b/str] with field ids [1, 2] and [2, 1]). That's probably more generally tied to schema evolution and I don't think we need to do anything special there.