Re: Type promotion in v3

Amogh Jahagirdar Mon, 26 Aug 2024 15:52:44 -0700

Just caught up on this discussion, thanks all for the insights!

I largely agree with Ryan's point that read-time transformation is
different from a normal type promotion even though the read-time
transformation may seem logically equivalent. The compelling part of
leveraging the upper/lower bounds is that the int/long to string promotion
is a good overall solution and should be able to be addressed with minimal
work. At the same time, I also agree with Micah's point that we should make
sure we analyze the implications of this particular int/long -> string
promotion on sort order(s), puffin and default values and any special
casing this may introduce for those capabilities.


There's compelling arguments for not supporting long to timestamp promotion
at the moment. There's the bounds issue and then looking at Micah's
analysis for other systems, they don't really support the long to ts
(except for postgres where if I understand right in the USING clause users
have to specify exactly how to perform the conversion).At the momentm
Iceberg users who really want this behavior could create a view until we
have the prerequisite metadata/can reach a consensus on how to handle the
precision.

Thanks,
Amogh Jahagirdar


On Thu, Aug 22, 2024 at 3:14 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> Thanks for the discussion, everyone. I think the back and forth between
> Fokko and Micah helped me understand Micah's position more clear. I can see
> how some of the challenges that I raised would be solved, like moving the
> previous field into the metadata of the transformation.
>
> I agree with a lot of what Dan said. I think we're really talking about
> different use cases. Type promotion allows you to change the type of a
> column, typically to accommodate more variety, like numbers larger than
> `INT_MAX`. In Iceberg, the model is that the current schema's type is
> compared with the type in a data file and a well-defined transformation
> occurs to convert the value. Iceberg type promotion should be a metadata
> operation and should not require rewriting data files (or, ideally,
> manifests).
>
> We _could_ solve the same problem by adding expressions and allowing those
> expressions in default values. Basically, we could support arbitrary
> read-time transformations and use it for this purpose. But those are
> different features. I think that approach would add a level of complexity
> that isn't needed right now and may or may not be needed long term.
>
> The main reason why I don't think that this is needed is that I think
> using structs for lower and upper bounds in manifests is a good solution.
> We will need to implement the columnar approach for Parquet manifests (or
> else you can't use columnar projection) so we will very likely do this work
> soon after v3. We could also do it in v3 if we want _more_ type promotion
> cases. That means introducing read-time transformations and using them for
> default values won't be necessary unless we want the transformation feature
> specifically.
>
> The last email also raises good questions about which type promotion cases
> to support. I agree that we should be cautious here. I think the
> controversial ones are long to timestamptz and long to string.
>
> For long to timestamptz, the use case is that many people wrote
> millisecond timestamps into long columns before timestamp was a broadly
> supported type. These were almost all millisecond precision because people
> used Java's `System.currentTimeMillis()`. The problem with long to
> timestamptz promotion is knowing the unit that was stored and rather than
> introducing a way to specify the default unit somewhere, we decided the
> simplest solution was to assume milliseconds. But now this is moot because
> this would break lower/upper bounds so I'm proposing we drop this type
> promotion case.
>
> For int/long to string, the primary use case is ID columns that change in
> an upstream database that is being mirrored into analytic storage. Since
> there is an issue with this case, I'm okay leaving it out if that's what we
> want to do as a community. We can add it later more safely.
>
> On Tue, Aug 20, 2024 at 10:14 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Thanks Dan,
>>
>> I'm pretty strongly opposed to the idea of assigning new field ids as
>>> part of type promotion.  I understand what we're trying to accomplish, but
>>> I just don't think that's the right mechanism to achieve it.  The field id
>>> specifically identifies the field and shouldn't change as
>>> attributes change (name, type, comment, etc.).
>>
>>
>> This seems to  be the consensus.  I'm OK with this, but I really think we
>> should be restricting type promotions to a smaller set (more on this
>> below). It still feels like it might be useful to explicitly track type
>> promotions in the current schema rather than having to refer and parse
>> historical schemas at scan time.
>>
>> Historically, we've only allowed obvious and safe promotion (e.g. int to
>>> long).  I think some of the more "practical" cases like long to timestamp
>>> have made this more confusing because they seem arbitrary in terms of how
>>> we define the promotion (e.g. always interpret as ms) and may be confusing
>>> the situation because we want to accommodate typical use cases where people
>>> are migrating data without rewriting.
>>
>>
>> ms to timestamp does seem arbitrary and as far as I can tell few if any
>> systems support this as a type conversion.
>>
>> FWIW, I did a quick survey of other technologies and I found the
>> following (I might have missed something or maybe there are more docs):
>>
>> Redshift
>> <https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html> [1]
>> - appears to only support widening varchar columns (and as far as I can
>> tell does not support any other type promotions).
>>
>> Snowflake
>> <https://docs.snowflake.com/en/sql-reference/sql/alter-table-column)> [2]
>> - Only supports changes that are metadata only or strict type widening (no
>> to string conversion), with the exception that precision can be narrowed if
>> all data currently fits into the column.
>>
>> BigQuery
>> <https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_data_type_statement>
>>  [3] -
>> Only supports some numeric conversion (mostly widening) and also widening
>> of types (no string conversion)
>>
>> Hive
>> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions>
>>  (by
>> default) [4] - Supports anything to string, and widening of numeric types.
>>
>> Delta Lake
>> <https://learn.microsoft.com/en-us/azure/databricks/delta/type-widening> [5]
>> - Numeric type widening (and date to timestampNTZ).  (no string conversion)
>>
>> Postgres - Implictly supports any conversion supporting "Assignment
>> casts" (I couldn't find a canonical list of built in assignment casts  but
>> integer to string seems to be included here) or arbitrary conversion via
>> "using expression".  As a side-note, IIUC, any column change rewrites the
>> entire table eagerly in postgres.
>>
>> I'm not convinced we want to make this more flexible or configurable in
>>> terms of how you can promote or what the representation is (e.g. different
>>> formats of date to string) as that introduces a lot of complexity and in
>>> most cases is safer to require a rewrite when making complex
>>> transformations on table.  I feel like those situations are better use
>>> cases for views and we shouldn't try to accommodate everything at the table
>>> level.
>>
>>
>> Where do you see the line between views and table level?  Views can be
>> used to accomplish all type promotions, but certainly there is value in
>> having some built in, and supporting data migrations directly in the table
>> format.  What is the rubric we will use to decide when to add a new
>> promotion rule?
>>
>> IMO, based on the analysis above (and also for simplicity).  I think
>> built in promotions should be limited to either metadata only operations,
>> or widening that is a padding only operation.
>>
>> Therefore, FWIW, I'm -0.0 on supporting promotion to string.  I'd like to
>> validate that the restrictions we are putting in place actually make it
>> usable for its intended purpose and that we can actually identify all the
>> ancillary features that need to be special cased. So far we have discussed
>> stats and partition transforms.  It seems sort order, puffin files and
>> default values also need to be examined (and possibly other features?).  I
>> also think that we still run into problems (perhaps minor) that "parquet as
>> manifests" don't solve here if we ever need to account for transitive
>> promotions (e.g. Integer->Decimal->String).
>>
>> Also, FWIW, I'm -1.0 on being able to promote long as milliseconds to
>> timestamps. This seems like an esoteric and arbitrary decision for
>> something core to the table format.
>>
>> If we want to make backfilling easier for more drastic type changes, then
>> I think treating them as explicitly adding a new column with defaults based
>> on an old column makes more sense to me (I imagine this wouldn't be
>> supported with SET COLUMN TYPE DDL, but rather ADD COLUMN given the
>> consensus the implicitly changing field ID is not a good idea).
>>
>> Thanks,
>> Micah
>>
>> [1] https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html
>> [2] https://docs.snowflake.com/en/sql-reference/sql/alter-table-column
>> [3]
>> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_data_type_statement
>> [4]
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions
>> [5]
>> https://learn.microsoft.com/en-us/azure/databricks/delta/type-widening
>>
>>
>> On Tue, Aug 20, 2024 at 12:59 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>>
>>> I'm pretty strongly opposed to the idea of assigning new field ids as
>>> part of type promotion.  I understand what we're trying to accomplish, but
>>> I just don't think that's the right mechanism to achieve it.  The field id
>>> specifically identifies the field and shouldn't change as
>>> attributes change (name, type, comment, etc.).
>>>
>>> Isolating this issue to the most basic level, we always have a file with
>>> a source physical/logical representation and a target representation
>>> defined by the current table schema.  If we enumerate the specific
>>> transforms between the source and the target in the spec, there is no need
>>> to have a way to represent those in at the metadata level.  Historically,
>>> we've only allowed obvious and safe promotion (e.g. int to long).  I think
>>> some of the more "practical" cases like long to timestamp have made this
>>> more confusing because they seem arbitrary in terms of how we define
>>> the promotion (e.g. always interpret as ms) and may be confusing the
>>> situation because we want to accommodate typical use cases where people are
>>> migrating data without rewriting.
>>>
>>> I'm not convinced we want to make this more flexible or configurable in
>>> terms of how you can promote or what the representation is (e.g. different
>>> formats of date to string) as that introduces a lot of complexity and in
>>> most cases is safer to require a rewrite when making complex
>>> transformations on table.  I feel like those situations are better use
>>> cases for views and we shouldn't try to accommodate everything at the table
>>> level.
>>>
>>> The issue here is that we don't have appropriate fidelity in the
>>> metadata to map the stats to the current schema, so we can either introduce
>>> more metadata to close the gap or make partial progress until there's a
>>> better option (like moving the parquet metadata with explicit type
>>> information at the stats level).
>>>
>>> In terms of keeping complexity low, I'd lean more towards restricting
>>> evolution (like with incompatible transforms) than trying to accommodate
>>> those use cases with a more complicated representation.
>>>
>>> Best,
>>> -Dan
>>>
>>>
>>> On Tue, Aug 20, 2024 at 10:14 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Fokko,
>>>>
>>>>
>>>>> In this case, we still need to keep the schemas. As an example:
>>>>
>>>> The example you gave is close to what I was imagining (if we get to
>>>> details I might have a slightly different organization).  This might be
>>>> semantic, but I don't see this as keeping "schemas", since all data is
>>>> present in the current schema, and not in historical schemas (i.e. schemas
>>>> with different IDs).
>>>>
>>>> My main concern is that this can explode in complexity, especially when
>>>>> we end up with cycles: date → timestamp → string → datetime. With custom
>>>>> formatting is it also easy to go: timestamp → string → timestamptz.
>>>>
>>>>
>>>> To be clear I'm not advocating for introducing  all of these elements,
>>>> but I think leaving the door open for them is useful.  On the topic of
>>>> complexity, here is an example of where having explicit lineage modelled is
>>>> useful:
>>>>
>>>> long->timestamp->string (timestamp to string is not proposed yet but
>>>> has come up in private conversations).
>>>>
>>>> I assume we want timestamp->string to produce a human readable
>>>> timestamp?  If we don't store explicit lineage what will the process be to
>>>> ensure we get "2024-08-20T09:51:51-07:00" instead of "1724172711000" if we
>>>> encounter a file written when the data was a "long"
>>>>
>>>>
>>>> I agree, and this is also mentioned in the proposal. But each
>>>>> granularity has different types, eg. timestamp, timestamp_ns. So this
>>>>> is not really a problem.
>>>>
>>>>
>>>> Sorry I don't think I've seen the proposal (
>>>> https://github.com/apache/iceberg/issues/9065 is light on details and
>>>> maybe my mailing list searches are failing to turn up anything on dev@).
>>>> Given that TIMESTAMP_NS didn't exist prior to V3, I'd expect at least some
>>>> people have been storing this data as long and might want to convert it to
>>>> a proper timestamp column (assuming the long is in MILLISECONDs appears to
>>>> preclude this)?  Similarly, it seems like we would be ruling out casting
>>>> "seconds".
>>>>
>>>> I think that's fair <https://github.com/apache/iceberg/pull/5151>, but
>>>>> believe we should be opinionated about how we store data. Would we be able
>>>>> to write hex-encoded strings?
>>>>
>>>>
>>>> The issue here is that binary->string (utf-8) can fail and should
>>>> ideally be handled.  hex-encoded strings cannot fail.  Right now I believe
>>>> there is again ambiguity if we would apply bytes->string for default
>>>> values, since we don't have enough metadata at the moment to distinguish
>>>> between these two values in the JSON serialized initial value (currently
>>>> bytes are written as hex encoded values, so if we read a file that did not
>>>> have the existing column present, it would surface the hex encoded bytes
>>>> and not Avro's/in place cast interpretation of the data).
>>>>
>>>> I'm hesitant to introduce this complexity since I think for example the
>>>>> formatting or parsing of custom formats should be done in the engine and
>>>>> not the format.
>>>>
>>>>
>>>> Again, I'm not suggesting we need to introduce complexity.  Given the
>>>> examples raised above I think there is already an issue with transforms in
>>>> scope.  The proposed solution of new field IDs and explicit lineage solves
>>>> short term problems, and if we wish gives more flexibility in the long run.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 20, 2024 at 4:32 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Yes, I was thinking it would be a recursive structure that tracked
>>>>>> each change.  Cleanup could be achieved by also tracking schema ID of the
>>>>>> last time the field was present along with the schema ID of the written
>>>>>> data files in manifests (as discussed on the other threads), and cleaning
>>>>>> up lineage when no data files were written in the schema.
>>>>>>
>>>>>
>>>>> In this case, we still need to keep the schemas. As an example:
>>>>>
>>>>> {"name": "col_1", "field-id": 1, "type": int}
>>>>>
>>>>> Schema after type alteration ("alter col_1 set type long"):
>>>>>
>>>>> {"name": "col_1", "field-id": 2, "initial-default": {
>>>>>    "function_name": to_long?
>>>>>    "input_argument": {
>>>>>        "column_id": 1,
>>>>>        "column_type": int
>>>>>    }
>>>>> }
>>>>>
>>>>> Schema after type alteration ("alter col_1 set type string"):
>>>>>
>>>>> {"name": "col_1", "field-id": 3, "initial-default": {
>>>>>    "function_name": to_string
>>>>>    "input_argument": {
>>>>>        "column_id": 2,
>>>>>        "column_type": long
>>>>>    }
>>>>> }
>>>>>
>>>>> If I understand correctly, you're suggesting:
>>>>>
>>>>> {"name": "col_1", "field-id": 3, "initial-default": {
>>>>>    "function_name": to_string
>>>>>    "input_argument": {
>>>>>        "column_id": 2,
>>>>>        "column_type": {
>>>>>            "function_name": to_long?
>>>>>            "input_argument": {
>>>>>                "column_id": 1,
>>>>>                "column_type": int
>>>>>            }
>>>>>        }
>>>>>    }
>>>>> }
>>>>>
>>>>> My main concern is that this can explode in complexity, especially
>>>>> when we end up with cycles: date → timestamp → string → datetime. With
>>>>> custom formatting is it also easy to go: timestamp → string → timestamptz.
>>>>>
>>>>> *  There is already an existing proposal
>>>>>> <https://github.com/apache/iceberg/issues/9065> to promote from
>>>>>> Long->Timestamp [2] which assumes milliseconds.  This seems like an
>>>>>> arbitrary choice, where one could reasonably have other time 
>>>>>> granularities
>>>>>> stored in Long values.
>>>>>
>>>>>
>>>>> I agree, and this is also mentioned in the proposal. But each
>>>>> granularity has different types, eg. timestamp, timestamp_ns. So this
>>>>> is not really a problem.
>>>>>
>>>>> *  Will "bytes to string" be a candidate? There are two reasonable
>>>>>> approaches for underlying data either using hex-encoded value or doing an
>>>>>> in-place conversion assuming all data is UTF-8?
>>>>>
>>>>>
>>>>> I think that's fair <https://github.com/apache/iceberg/pull/5151>,
>>>>> but believe we should be opinionated about how we store data. Would we be
>>>>> able to write hex-encoded strings?
>>>>>
>>>>> I'd argue that once a schema is going from "any type"->"string",
>>>>>> something  was fairly wrong with data modelling initially, providing more
>>>>>> tools to help users fix these types of issues seems beneficial in the 
>>>>>> long
>>>>>> run (again not something that needs to be done now but laying the
>>>>>> ground-work is useful).
>>>>>
>>>>>
>>>>> Similar to the point above we should be opinionated about this. For
>>>>> example, historically we've been parsing dates strictly, as an example, 
>>>>> see
>>>>> DateTimeUtil
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java>.
>>>>>
>>>>>
>>>>> I'm hesitant to introduce this complexity since I think for example
>>>>> the formatting or parsing of custom formats should be done in the engine
>>>>> and not the format.
>>>>>
>>>>> Kind regards,
>>>>> Fokko Driesprong
>>>>>
>>>>>
>>>>>
>>>>> Op di 20 aug 2024 om 05:58 schreef Micah Kornfield <
>>>>> emkornfi...@gmail.com>:
>>>>>
>>>>>> Hi Xiangjin,
>>>>>>
>>>>>>  Could you elaborate a bit more on how the Parquet manifest would fix
>>>>>>> the type promotion problem? If the lower and upper bounds are still a 
>>>>>>> map
>>>>>>> of <int, binary>, I don't think we can perform column pruning on that, 
>>>>>>> and
>>>>>>> the type information of the stat column is still missing.
>>>>>>
>>>>>>
>>>>>> I think the idea with Parquet files is one would no longer use a map
>>>>>> to track these statistics but instead have a column per 
>>>>>> field-id/statistics
>>>>>> pair.  So for each column in the original schema that would wanted to 
>>>>>> have
>>>>>> either:
>>>>>>
>>>>>> struct col1_stats {
>>>>>>    int min_value;
>>>>>>    int max_value;
>>>>>>    long null_count,
>>>>>>    etc.
>>>>>> }
>>>>>>
>>>>>> or a struct per statistic with each column:
>>>>>>
>>>>>> struct min_stats {
>>>>>>    int col1;
>>>>>>    string col2;
>>>>>>   etc
>>>>>> }
>>>>>>
>>>>>> This is similar to how partition values are stored today in Avro.
>>>>>> And I don't think there is anything stopping from doing this in Avro
>>>>>> either, except it is potentially less useful because you can't save much 
>>>>>> by
>>>>>> selecting only columns in Avro.
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 19, 2024 at 7:50 PM xianjin <xian...@apache.org> wrote:
>>>>>>
>>>>>>> Hey Ryan,
>>>>>>>
>>>>>>> Thanks for the reply, it clears most things up. Some responses
>>>>>>> inline:
>>>>>>>
>>>>>>> > This ends up being a little different because we can detect more
>>>>>>> cases when the bounds must have been strings — any time when the length 
>>>>>>> of
>>>>>>> the upper and lower bound is different. Because strings tend to have 
>>>>>>> longer
>>>>>>> values (we truncate to 16 chars by default) and are not typically a 
>>>>>>> fixed
>>>>>>> number of bytes, we don’t expect very many cases in which you’d have to
>>>>>>> discard the string bounds.
>>>>>>>
>>>>>>> > That said, if we want to be safe we could also wait to introduce
>>>>>>> int and long to string promotion, or put it behind a flag to opt in.
>>>>>>>
>>>>>>> Thanks for the clarification. I would prefer to put it behind a flag
>>>>>>> to opt in, considering the complexity it involves and the potential
>>>>>>> interests from users.
>>>>>>>
>>>>>>> > We’ve discussed this topic before and the consensus has been to
>>>>>>> allow type promotion if there is no partition field that would be 
>>>>>>> affected.
>>>>>>> Here’s the wording I’ve added in the PR:
>>>>>>>
>>>>>>> Type promotion is not allowed for a field that is referenced by
>>>>>>> source-id or source-ids of a partition field if the partition
>>>>>>> transform would produce a different value after promoting the type.
>>>>>>>
>>>>>>> > I think that covers the cases.
>>>>>>>
>>>>>>> Yes, I think it's indeed a good solution.
>>>>>>>
>>>>>>>> > There might be an easy/light way to add this new metadata: we
>>>>>>>> can persist schema_id in the DataFile.
>>>>>>>
>>>>>>> > Fokko already covered some of the challenges here and I agree with
>>>>>>>> those. I think that we want to be able to discard older schemas or not 
>>>>>>>> send
>>>>>>>> them to clients. And I think that since we can fix this in the next 
>>>>>>>> rev of
>>>>>>>> the spec, adding metadata just for this purpose is not a good 
>>>>>>>> strategy. I’d
>>>>>>>> much rather work on the long-term fix than add metadata and complicated
>>>>>>>> logic to pass schemas through and detect the type.
>>>>>>>
>>>>>>> > I think it might be worthy as we can always rewrite old data
>>>>>>>> files to use the latest schema.
>>>>>>>
>>>>>>> > I agree, but this doesn’t solve the problem with the current
>>>>>>>> manifest files. This is something I would introduce when we add Parquet
>>>>>>>> manifest files.
>>>>>>>
>>>>>>>
>>>>>>>  Could you elaborate a bit more on how the Parquet manifest would
>>>>>>> fix the type promotion problem? If the lower and upper bounds are still 
>>>>>>> a
>>>>>>> map of <int, binary>, I don't think we can perform column pruning on 
>>>>>>> that,
>>>>>>> and the type information of the stat column is still missing.
>>>>>>>
>>>>>>> On Tue, Aug 20, 2024 at 1:02 AM Ryan Blue
>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>
>>>>>>>> If the reader logic depends on the length of data in bytes, will
>>>>>>>> this prevent us from adding any type promotions to string?
>>>>>>>>
>>>>>>>> This ends up being a little different because we can detect more
>>>>>>>> cases when the bounds must have been strings — any time when the 
>>>>>>>> length of
>>>>>>>> the upper and lower bound is different. Because strings tend to have 
>>>>>>>> longer
>>>>>>>> values (we truncate to 16 chars by default) and are not typically a 
>>>>>>>> fixed
>>>>>>>> number of bytes, we don’t expect very many cases in which you’d have to
>>>>>>>> discard the string bounds.
>>>>>>>>
>>>>>>>> That said, if we want to be safe we could also wait to introduce
>>>>>>>> int and long to string promotion, or put it behind a flag to opt in.
>>>>>>>>
>>>>>>>> There might be an easy/light way to add this new metadata: we can
>>>>>>>> persist schema_id in the DataFile.
>>>>>>>>
>>>>>>>> Fokko already covered some of the challenges here and I agree with
>>>>>>>> those. I think that we want to be able to discard older schemas or not 
>>>>>>>> send
>>>>>>>> them to clients. And I think that since we can fix this in the next 
>>>>>>>> rev of
>>>>>>>> the spec, adding metadata just for this purpose is not a good 
>>>>>>>> strategy. I’d
>>>>>>>> much rather work on the long-term fix than add metadata and complicated
>>>>>>>> logic to pass schemas through and detect the type.
>>>>>>>>
>>>>>>>> I think there’s also another aspect to consider: whether the new
>>>>>>>> type promotion is compatible with partition transforms
>>>>>>>>
>>>>>>>> We’ve discussed this topic before and the consensus has been to
>>>>>>>> allow type promotion if there is no partition field that would be 
>>>>>>>> affected.
>>>>>>>> Here’s the wording I’ve added in the PR:
>>>>>>>>
>>>>>>>> Type promotion is not allowed for a field that is referenced by
>>>>>>>> source-id or source-ids of a partition field if the partition
>>>>>>>> transform would produce a different value after promoting the type.
>>>>>>>>
>>>>>>>> I think that covers the cases.
>>>>>>>>
>>>>>>>> I think it might be worthy as we can always rewrite old data files
>>>>>>>> to use the latest schema.
>>>>>>>>
>>>>>>>> I agree, but this doesn’t solve the problem with the current
>>>>>>>> manifest files. This is something I would introduce when we add Parquet
>>>>>>>> manifest files.
>>>>>>>>
>>>>>>>> On Mon, Aug 19, 2024 at 9:26 AM Amogh Jahagirdar 2am...@gmail.com
>>>>>>>> <http://mailto:2am...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hey all,
>>>>>>>>>
>>>>>>>>> > There might be an easy/light way to add this new metadata: we
>>>>>>>>> can persist schema_id in the DataFile. It still adds some extra size 
>>>>>>>>> to the
>>>>>>>>> manifest file but should be negligible?
>>>>>>>>>
>>>>>>>>> I do think it's probably negligible in terms of the size (at least
>>>>>>>>> in terms of the value that we get out of having that field) but I 
>>>>>>>>> think the
>>>>>>>>> main downside of that approach is that it's still an additional spec 
>>>>>>>>> change
>>>>>>>>> which if we consider going down the route of Parquet manifests then 
>>>>>>>>> that
>>>>>>>>> field largely becomes moot/throw-away work at that point since the 
>>>>>>>>> type
>>>>>>>>> information would exist in Parquet.
>>>>>>>>>
>>>>>>>>> > And I think there’s also another aspect to consider: whether the
>>>>>>>>> new type promotion is compatible with partition transforms. Currently 
>>>>>>>>> all
>>>>>>>>> the partition transforms produce the same result for promoted types: 
>>>>>>>>> int ->
>>>>>>>>> long, float -> double. If we are adding a new type promotion, current
>>>>>>>>> partition transforms will produce different results for type 
>>>>>>>>> promotion such
>>>>>>>>> as int/long -> string, so the partition() of DataFile will not hold 
>>>>>>>>> for
>>>>>>>>> promoted types. One possible way to fix that would be evolving the
>>>>>>>>> PartitionSpec with a new one?
>>>>>>>>>
>>>>>>>>> Yes, an important part of type promotion is validation that
>>>>>>>>> whatever evolution is being attempted can actually happen if the 
>>>>>>>>> column
>>>>>>>>> being evolved is part of a partition transform! I was working on an
>>>>>>>>> implementation for this and so far it's just a strict check that the 
>>>>>>>>> field
>>>>>>>>> being promoted is not part of the partition spec. The proposed way of
>>>>>>>>> evolving the spec sounds plausible specifically for int/long -> 
>>>>>>>>> string,
>>>>>>>>> although I'd need to double check/think through implications.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Aug 19, 2024 at 8:43 AM Xianjin YE <xian...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Fokko,
>>>>>>>>>>
>>>>>>>>>> > Distribute all the schemas to the executors, and we have to do
>>>>>>>>>> the lookup and comparison there.
>>>>>>>>>>
>>>>>>>>>> I don’t think this would be a problem: the schema id in the
>>>>>>>>>> DataFile should be only used in driver’s planning phase to determine 
>>>>>>>>>> the
>>>>>>>>>> lower/upper bounds, so no extra schema except the current one should 
>>>>>>>>>> be
>>>>>>>>>> distributed to the executor. Even if all schemas are required, they 
>>>>>>>>>> can be
>>>>>>>>>> retrieved from SerializableTable’s lazyTable() method.
>>>>>>>>>>
>>>>>>>>>> > Not being able to prune old schema until they are not used
>>>>>>>>>> anymore (including all historical snapshots).
>>>>>>>>>>
>>>>>>>>>> That’s indeed a problem. However we haven’t add the ability to
>>>>>>>>>> prune unused schemas yet(which I would like to add too after
>>>>>>>>>> RemoveUnusedSpecs), we can consider that when implementing. BTW, I 
>>>>>>>>>> think it
>>>>>>>>>> might be worthy as we can always rewrite old data files to use the 
>>>>>>>>>> latest
>>>>>>>>>> schema.
>>>>>>>>>>
>>>>>>>>>> On Aug 19, 2024, at 22:19, Fokko Driesprong <fo...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Ryan for bringing this up, that's an
>>>>>>>>>> interesting problem, let me think about this.
>>>>>>>>>>
>>>>>>>>>> we can persist schema_id in the DataFile
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This was also my first thought. The two drawbacks are:
>>>>>>>>>>
>>>>>>>>>>    - Distribute all the schemas to the executors, and we have to
>>>>>>>>>>    do the lookup and comparison there.
>>>>>>>>>>    - Not being able to prune old schema until they are not used
>>>>>>>>>>    anymore (including all historical snapshots).
>>>>>>>>>>
>>>>>>>>>> If we are adding new type promotion, current partition transforms
>>>>>>>>>>> will produce different result for type promotion such as int/long ->
>>>>>>>>>>> string, so the partition() of DataFile will not hold for promoted 
>>>>>>>>>>> types.
>>>>>>>>>>> One possible way to fix that would be evolving the PartitionSpec 
>>>>>>>>>>> with a new
>>>>>>>>>>> one?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That's a good call! Currently, we create partition spec
>>>>>>>>>> evaluators based on the partition-spec-id. Evolving the partition 
>>>>>>>>>> spec
>>>>>>>>>> would fix it. When we decide to include the schema-id, we would be 
>>>>>>>>>> able to
>>>>>>>>>> create the evaluator based on the (partition-spec-id, schema-id) 
>>>>>>>>>> tuple when
>>>>>>>>>> evaluating the partitions.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Fokko
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Op ma 19 aug 2024 om 15:59 schreef Xianjin YE <xian...@apache.org
>>>>>>>>>> >:
>>>>>>>>>>
>>>>>>>>>>> Thanks Ryan for bringing this up.
>>>>>>>>>>>
>>>>>>>>>>> > int and long to string
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Could you elaborate a bit on how we can support type promotion
>>>>>>>>>>> for `int` and `long` to `string` if the upper and lower bounds are 
>>>>>>>>>>> already
>>>>>>>>>>> encoded in 4/8 bytes binary? It seems that we cannot add promotions 
>>>>>>>>>>> to
>>>>>>>>>>> string as Piotr pointed out?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> > My rationale for not adding new information to track the
>>>>>>>>>>> bound types at the time that the data file metadata is created is 
>>>>>>>>>>> that it
>>>>>>>>>>> would inflate the size of manifests and push out the timeline for 
>>>>>>>>>>> getting
>>>>>>>>>>> v3 done.
>>>>>>>>>>>
>>>>>>>>>>> There might be an easy/light way to add this new metadata: we
>>>>>>>>>>> can persist schema_id in the DataFile. It still adds some extra 
>>>>>>>>>>> size to the
>>>>>>>>>>> manifest file but should be negligible?
>>>>>>>>>>>
>>>>>>>>>>> And I think there’s also another aspect to consider: whether the
>>>>>>>>>>> new type promotion is compatible with partition transforms. 
>>>>>>>>>>> Currently all
>>>>>>>>>>> the partition transforms produce the same result for promoted 
>>>>>>>>>>> types: int ->
>>>>>>>>>>> long, float -> double. If we are adding new type promotion, current
>>>>>>>>>>> partition transforms will produce different result for type 
>>>>>>>>>>> promotion such
>>>>>>>>>>> as int/long -> string, so the partition() of DataFile will not hold 
>>>>>>>>>>> for
>>>>>>>>>>> promoted types. One possible way to fix that would be evolving the
>>>>>>>>>>> PartitionSpec with a new one?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I’ve recently been working on updating the spec for new types
>>>>>>>>>>> and type promotion cases in v3.
>>>>>>>>>>> I was talking to Micah and he pointed out an issue with type
>>>>>>>>>>> promotion: the upper and lower bounds for data file columns that 
>>>>>>>>>>> are kept
>>>>>>>>>>> in Avro manifests don’t have any information about the type that 
>>>>>>>>>>> was used
>>>>>>>>>>> to encode the bounds.
>>>>>>>>>>> For example, when writing to a table with a float column, 4: f,
>>>>>>>>>>> the manifest’s lower_bounds and upper_bounds maps will have an 
>>>>>>>>>>> entry with
>>>>>>>>>>> the type ID (4) as the key and a 4-byte encoded float for the 
>>>>>>>>>>> value. If
>>>>>>>>>>> column f were later promoted to double, those maps aren’t changed. 
>>>>>>>>>>> The way
>>>>>>>>>>> we currently detect that the type was promoted is to check the 
>>>>>>>>>>> binary value
>>>>>>>>>>> and read it as a float if there are 4 bytes instead of 8. This 
>>>>>>>>>>> prevents us
>>>>>>>>>>> from adding int to double type promotion because when there are 4 
>>>>>>>>>>> bytes we
>>>>>>>>>>> would not know whether the value was originally an int or a float.
>>>>>>>>>>> Several of the type promotion cases from my previous email hit
>>>>>>>>>>> this problem. Date/time types to string, int and long to string, 
>>>>>>>>>>> and long
>>>>>>>>>>> to timestamp are all affected. I think the best path forward is to 
>>>>>>>>>>> add
>>>>>>>>>>> fewer type promotion cases to v3 and support only these new cases:
>>>>>>>>>>> • int and long to string
>>>>>>>>>>> • date to timestamp
>>>>>>>>>>> • null/unknown to any
>>>>>>>>>>> • any to variant (if supported by the Variant spec)
>>>>>>>>>>> That list would allow us to keep using the current strategy and
>>>>>>>>>>> not add new metadata to track the type to our manifests. My 
>>>>>>>>>>> rationale for
>>>>>>>>>>> not adding new information to track the bound types at the time 
>>>>>>>>>>> that the
>>>>>>>>>>> data file metadata is created is that it would inflate the size of
>>>>>>>>>>> manifests and push out the timeline for getting v3 done. Many of us 
>>>>>>>>>>> would
>>>>>>>>>>> like to get v3 released to get the timestamp_ns and variant types 
>>>>>>>>>>> out. And
>>>>>>>>>>> if we can get at least some of the promotion cases out that’s 
>>>>>>>>>>> better.
>>>>>>>>>>> To address type promotion in the long term, I think that we
>>>>>>>>>>> should consider moving to Parquet manifests. This has been 
>>>>>>>>>>> suggested a few
>>>>>>>>>>> times so that we can project just the lower and upper bounds that 
>>>>>>>>>>> are
>>>>>>>>>>> needed for scan planning. That would also fix type promotion 
>>>>>>>>>>> because the
>>>>>>>>>>> manifest file schema would include full type information for the 
>>>>>>>>>>> stats
>>>>>>>>>>> columns. Given the complexity of releasing Parquet manifests, I 
>>>>>>>>>>> think it
>>>>>>>>>>> makes more sense to get a few promotion cases done now in v3 and 
>>>>>>>>>>> follow up
>>>>>>>>>>> with the rest in v4.
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Databricks
>>>>>>>>
>>>>>>>
>
> --
> Ryan Blue
> Databricks
>

Re: Type promotion in v3

Reply via email to