I’ve recently been working on updating the spec for new types and type promotion cases in v3.
I was talking to Micah and he pointed out an issue with type promotion: the upper and lower bounds for data file columns that are kept in Avro manifests don’t have any information about the type that was used to encode the bounds. For example, when writing to a table with a float column, 4: f, the manifest’s lower_bounds and upper_bounds maps will have an entry with the type ID (4) as the key and a 4-byte encoded float for the value. If column f were later promoted to double, those maps aren’t changed. The way we currently detect that the type was promoted is to check the binary value and read it as a float if there are 4 bytes instead of 8. This prevents us from adding int to double type promotion because when there are 4 bytes we would not know whether the value was originally an int or a float. Several of the type promotion cases from my previous email hit this problem. Date/time types to string, int and long to string, and long to timestamp are all affected. I think the best path forward is to add fewer type promotion cases to v3 and support only these new cases: - int and long to string - date to timestamp - null/unknown to any - any to variant (if supported by the Variant spec) That list would allow us to keep using the current strategy and not add new metadata to track the type to our manifests. My rationale for not adding new information to track the bound types at the time that the data file metadata is created is that it would inflate the size of manifests and push out the timeline for getting v3 done. Many of us would like to get v3 released to get the timestamp_ns and variant types out. And if we can get at least some of the promotion cases out that’s better. To address type promotion in the long term, I think that we should consider moving to Parquet manifests. This has been suggested a few times so that we can project just the lower and upper bounds that are needed for scan planning. That would also fix type promotion because the manifest file schema would include full type information for the stats columns. Given the complexity of releasing Parquet manifests, I think it makes more sense to get a few promotion cases done now in v3 and follow up with the rest in v4. Ryan -- Ryan Blue