For TINYINT and SMALLINT, I don't think there is any advantage at the storage layer. Avro uses variable-length ints and the columnar formats, Parquet and ORC, will do efficient encodings for multiple values in a column. I don't see much value in these types, besides compatibility with existing SQL.
Thanks for working on this in the Trino community as well! On Fri, Dec 10, 2021 at 10:58 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Just to update this thread, we have agreed internally to use INT in the > struct schema corresponding to union types. The reasons are two-fold: > (1) Uncertainty around whether TINYINT will make it to Iceberg while we > wanted to stick to the spec. > (2) Since Avro does not support TINYINT either, this issue can be faced > independently of whether Iceberg supports TINYINT or not. > > That said, we had preliminary discussions with Dain Sundstrom and David > Phillips on the Trino representation of the schema, and we agree it is > directionally okay to change TINYINT to INT in the Trino version of the > conversion [1]. > > I will start another thread to check people's opinion on converting union > types to structs of that spec, and if we agree, we can start another > discussion with the Spark community. > > Thanks, > Walaa. > > [1] https://github.com/trinodb/trino/pull/3483 > > On Fri, Dec 3, 2021 at 11:49 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> Also, wanted to add another observation that is on the flip side of the >> initial argument. Some data formats like Avro do not support TINYINT >> either. So even if Trino uses TINYINT in the struct schema, when the table >> is written back to Avro, TINYINT will not be used. This supports the side >> of the argument that Trino should use INT in >> https://github.com/trinodb/trino/pull/3483 to minimize incompatibility >> issues. Thoughts? >> >> On Thu, Dec 2, 2021 at 5:28 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Hi all, >>> >>> I wanted to bring up the topic of supporting TINYINT and SMALLINT data >>> types in Iceberg to the dev mailing list. We have started some discussion >>> on the ASF Iceberg channel [1], and agreed to bring more visibility to the >>> topic here. >>> >>> This came up while we were implementing the equivalent of Trino's union >>> to struct conversion [2] in our internal fork of Iceberg, but we are facing >>> the issue that the output struct schema contains a field that is a TINYINT. >>> I am starting this thread to get people's opinion on adding TINYINT and >>> SMALLINT types to Iceberg. The reasons for the change are: >>> >>> * TINYINT (and SMALLINT) are supported by the Hive table format, and >>> consequently by engines like Trino and Spark. So adding them to Iceberg can >>> help with Iceberg adoption, and makes it easy for people to convert Hive >>> tables to Iceberg tables smoothly. >>> >>> * We have observed some regressions when trying to convert a table with >>> TINYINT to INT, especially with views that are defined on the TINYINT table >>> (see the Slack thread for details). >>> >>> * We want to converge the community to one form of transforming union >>> types to structs. Currently, the Trino Hive connector uses TINYINT in the >>> output schema, so we want to lock down a type there. Hopefully, that >>> becomes a standard across engines, and IRs. We are doing the same >>> conversion in Coral [3], but hoping to wrap our discussion here before >>> proceeding. >>> >>> * The more systems not agreeing on this conversion (to TINYINT or INT), >>> the more challenging it will be to fix this problem. Sometimes the schema >>> leaks transitively across tables and it becomes more difficult to retract, >>> so now is the best time to decide on this. >>> >>> That said, first, I am hoping to get an alignment on supporting TINYINT >>> and SMALLINT in Iceberg since this is a more fundamental issue, and has its >>> own justications. If we agree we can support TINYINT, the union type >>> standardization discussion would be simpler since we will not expect Trino >>> to change. >>> >>> Thanks, >>> Walaa. >>> >>> [1] https://the-asf.slack.com/archives/CF01LKV9S/p1637642089008400 >>> [2] https://github.com/trinodb/trino/pull/3483 >>> [3] https://github.com/linkedin/coral/pull/192 >>> >> -- Ryan Blue Tabular