I’ve recently been working on updating the spec for new types and type
promotion cases in v3.

I was talking to Micah and he pointed out an issue with type promotion: the
upper and lower bounds for data file columns that are kept in Avro
manifests don’t have any information about the type that was used to encode
the bounds.

For example, when writing to a table with a float column, 4: f, the
manifest’s lower_bounds and upper_bounds maps will have an entry with the
type ID (4) as the key and a 4-byte encoded float for the value. If column f
were later promoted to double, those maps aren’t changed. The way we
currently detect that the type was promoted is to check the binary value
and read it as a float if there are 4 bytes instead of 8. This prevents us
from adding int to double type promotion because when there are 4 bytes we
would not know whether the value was originally an int or a float.

Several of the type promotion cases from my previous email hit this
problem. Date/time types to string, int and long to string, and long to
timestamp are all affected. I think the best path forward is to add fewer
type promotion cases to v3 and support only these new cases:

   - int and long to string
   - date to timestamp
   - null/unknown to any
   - any to variant (if supported by the Variant spec)

That list would allow us to keep using the current strategy and not add new
metadata to track the type to our manifests. My rationale for not adding
new information to track the bound types at the time that the data file
metadata is created is that it would inflate the size of manifests and push
out the timeline for getting v3 done. Many of us would like to get v3
released to get the timestamp_ns and variant types out. And if we can get
at least some of the promotion cases out that’s better.

To address type promotion in the long term, I think that we should consider
moving to Parquet manifests. This has been suggested a few times so that we
can project just the lower and upper bounds that are needed for scan
planning. That would also fix type promotion because the manifest file
schema would include full type information for the stats columns. Given the
complexity of releasing Parquet manifests, I think it makes more sense to
get a few promotion cases done now in v3 and follow up with the rest in v4.

Ryan

-- 
Ryan Blue

Reply via email to