Thanks Ryan for bringing this up.

> int and long to string


Could you elaborate a bit on how we can support type promotion for `int` and 
`long` to `string` if the upper and lower bounds are already encoded in 4/8 
bytes binary? It seems that we cannot add promotions to string as Piotr pointed 
out?


> My rationale for not adding new information to track the bound types at the 
> time that the data file metadata is created is that it would inflate the size 
> of manifests and push out the timeline for getting v3 done.

There might be an easy/light way to add this new metadata: we can persist 
schema_id in the DataFile. It still adds some extra size to the manifest file 
but should be negligible? 

And I think there’s also another aspect to consider: whether the new type 
promotion is compatible with partition transforms. Currently all the partition 
transforms produce the same result for promoted types: int -> long, float -> 
double. If we are adding new type promotion, current partition transforms will 
produce different result for type promotion such as int/long -> string, so the 
partition() of DataFile will not hold for promoted types. One possible way to 
fix that would be evolving the PartitionSpec with a new one?


> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org> wrote:
> 
> I’ve recently been working on updating the spec for new types and type 
> promotion cases in v3.
> I was talking to Micah and he pointed out an issue with type promotion: the 
> upper and lower bounds for data file columns that are kept in Avro manifests 
> don’t have any information about the type that was used to encode the bounds.
> For example, when writing to a table with a float column, 4: f, the 
> manifest’s lower_bounds and upper_bounds maps will have an entry with the 
> type ID (4) as the key and a 4-byte encoded float for the value. If column f 
> were later promoted to double, those maps aren’t changed. The way we 
> currently detect that the type was promoted is to check the binary value and 
> read it as a float if there are 4 bytes instead of 8. This prevents us from 
> adding int to double type promotion because when there are 4 bytes we would 
> not know whether the value was originally an int or a float.
> Several of the type promotion cases from my previous email hit this problem. 
> Date/time types to string, int and long to string, and long to timestamp are 
> all affected. I think the best path forward is to add fewer type promotion 
> cases to v3 and support only these new cases:
>     • int and long to string
>     • date to timestamp
>     • null/unknown to any
>     • any to variant (if supported by the Variant spec)
> That list would allow us to keep using the current strategy and not add new 
> metadata to track the type to our manifests. My rationale for not adding new 
> information to track the bound types at the time that the data file metadata 
> is created is that it would inflate the size of manifests and push out the 
> timeline for getting v3 done. Many of us would like to get v3 released to get 
> the timestamp_ns and variant types out. And if we can get at least some of 
> the promotion cases out that’s better.
> To address type promotion in the long term, I think that we should consider 
> moving to Parquet manifests. This has been suggested a few times so that we 
> can project just the lower and upper bounds that are needed for scan 
> planning. That would also fix type promotion because the manifest file schema 
> would include full type information for the stats columns. Given the 
> complexity of releasing Parquet manifests, I think it makes more sense to get 
> a few promotion cases done now in v3 and follow up with the rest in v4.
> Ryan
> 
> -- 
> Ryan Blue

Reply via email to