For TINYINT and SMALLINT, I don't think there is any advantage at the
storage layer. Avro uses variable-length ints and the columnar formats,
Parquet and ORC, will do efficient encodings for multiple values in a
column. I don't see much value in these types, besides compatibility with
existing SQL.

Thanks for working on this in the Trino community as well!

On Fri, Dec 10, 2021 at 10:58 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Just to update this thread, we have agreed internally to use INT in the
> struct schema corresponding to union types. The reasons are two-fold:
> (1) Uncertainty around whether TINYINT will make it to Iceberg while we
> wanted to stick to the spec.
> (2) Since Avro does not support TINYINT either, this issue can be faced
> independently of whether Iceberg supports TINYINT or not.
>
> That said, we had preliminary discussions with Dain Sundstrom and David
> Phillips on the Trino representation of the schema, and we agree it is
> directionally okay to change TINYINT to INT in the Trino version of the
> conversion [1].
>
> I will start another thread to check people's opinion on converting union
> types to structs of that spec, and if we agree, we can start another
> discussion with the Spark community.
>
> Thanks,
> Walaa.
>
> [1]  https://github.com/trinodb/trino/pull/3483
>
> On Fri, Dec 3, 2021 at 11:49 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Also, wanted to add another observation that is on the flip side of the
>> initial argument. Some data formats like Avro do not support TINYINT
>> either. So even if Trino uses TINYINT in the struct schema, when the table
>> is written back to Avro, TINYINT will not be used. This supports the side
>> of the argument that Trino should use INT in
>> https://github.com/trinodb/trino/pull/3483 to minimize incompatibility
>> issues. Thoughts?
>>
>> On Thu, Dec 2, 2021 at 5:28 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I wanted to bring up the topic of supporting TINYINT and SMALLINT data
>>> types in Iceberg to the dev mailing list. We have started some discussion
>>> on the ASF Iceberg channel [1], and agreed to bring more visibility to the
>>> topic here.
>>>
>>> This came up while we were implementing the equivalent of Trino's union
>>> to struct conversion [2] in our internal fork of Iceberg, but we are facing
>>> the issue that the output struct schema contains a field that is a TINYINT.
>>> I am starting this thread to get people's opinion on adding TINYINT and
>>> SMALLINT types to Iceberg. The reasons for the change are:
>>>
>>> * TINYINT (and SMALLINT) are supported by the Hive table format, and
>>> consequently by engines like Trino and Spark. So adding them to Iceberg can
>>> help with Iceberg adoption, and makes it easy for people to convert Hive
>>> tables to Iceberg tables smoothly.
>>>
>>> * We have observed some regressions when trying to convert a table with
>>> TINYINT to INT, especially with views that are defined on the TINYINT table
>>> (see the Slack thread for details).
>>>
>>> * We want to converge the community to one form of transforming union
>>> types to structs. Currently, the Trino Hive connector uses TINYINT in the
>>> output schema, so we want to lock down a type there. Hopefully, that
>>> becomes a standard across engines, and IRs. We are doing the same
>>> conversion in Coral [3], but hoping to wrap our discussion here before
>>> proceeding.
>>>
>>> * The more systems not agreeing on this conversion (to TINYINT or INT),
>>> the more challenging it will be to fix this problem. Sometimes the schema
>>> leaks transitively across tables and it becomes more difficult to retract,
>>> so now is the best time to decide on this.
>>>
>>> That said, first, I am hoping to get an alignment on supporting TINYINT
>>> and SMALLINT in Iceberg since this is a more fundamental issue, and has its
>>> own justications. If we agree we can support TINYINT, the union type
>>> standardization discussion would be simpler since we will not expect Trino
>>> to change.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> [1] https://the-asf.slack.com/archives/CF01LKV9S/p1637642089008400
>>> [2] https://github.com/trinodb/trino/pull/3483
>>> [3] https://github.com/linkedin/coral/pull/192
>>>
>>

-- 
Ryan Blue
Tabular

Reply via email to