Hi Ryan and all, That sounds like a reasonable reason to leave IP address types out. In my experience, dedicated IP address types are mostly found in logging tools and other things for sysadmins / DevOps etc.
When querying data with IP addresses, I’ve seen it done quite a lot (eg security reasons) but usually stored as string or manipulated in a UDF. They’re not commonly supported types. I would also draw the line at UUID types. - Kyle Bendickson > On Jul 30, 2021, at 3:15 PM, Ryan Blue <b...@tabular.io> wrote: > > > Jacques, you make some good points here. I think my argument about usability > leading to performance issues is a stronger argument for engines than for > Iceberg. Still, there are inefficiencies in Iceberg if someone chooses to use > a string in an engine that doesn't have a UUID type. > > Another thing to consider is cross-engine support. If Iceberg removes UUID, > then Trino would probably translate to fixed[16]. That results in a table > that's difficult to query in other engines, where people would probably > choose to store the data as a string. On the other hand, if Iceberg keeps the > UUID type then integrations would simply translate to the UUID string > representation before passing data to the other engines. While the engines > would be using 36-byte values in join keys, the user experience issue is > fixed and the data is more compact on disk and in Iceberg's bounds metadata. > > While having a UUID type in Iceberg can't really help engines that don't > support UUID take advantage of the type at runtime, it does seem slightly > better to have the UUID type in general since at least one engine supports it > and it provides the expected user experience with a compact representation. > > IPv4 addresses are a good thing to think about as well, since most of the > same arguments apply. If we keep the UUID type, should we also add IPv4 or > IPv6 types? I would probably draw the line at UUID because it helps in joins, > which are an important operation. IPv4 representations aren't that big of an > inconvenience unless you need to do IP manipulation, which is typically in a > UDF and not the query engine. And you can always keep both representations in > a table fairly inexpensively. Does this sound like a valid rationale for > having UUID but not IP types? > > Ryan > >> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <jacquesnad...@gmail.com> >> wrote: >> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native type. >> Which engines are you thinking of that have a native UUID type besides the >> Presto derivatives and support Iceberg? >> >> I agree that Trino should expose a UUID type on top of Iceberg tables. All >> the user experience things that you are describing as important (compact >> storage, friendly display, ddl, clean literals) are possible without it >> being a first class type in Iceberg using a trino specific property. >> >> I don't really have a strong opinion about UUID. In general, type bloat is >> probably just a part of this kind of project. Generally, CHAR(X) and >> VARCHAR(X) feel like much bigger concerns given that they exist in all of >> the engines but not Iceberg--especially when we start talking about views. >> >> Some of this argues for physical vs logical type abstraction. (Something >> that was always challenging in Parquet but also helped to resolve how these >> types are managed in engines that don't support them.) >> >> thanks, >> Jacques >> >> PS: Funny aside, the bloat on an ip address is actually worse than a UUID, >> right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID >> 36/16 => 125% bloat. >> >>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <b...@tabular.io> wrote: >>> I don't think this is just a problem in Trino. >>> >>> If there is no UUID type, then a user must choose between a 36-byte string >>> and a 16-byte binary. That's not a good choice to force people into. If >>> someone chooses binary, then it's harder to work with rows and construct >>> queries even though there is a standard representation for UUIDs. To avoid >>> the user headache, people will probably choose to store values as strings. >>> Using a string would mean that more than half the value is needlessly >>> discarded by default in Iceberg lower/upper bounds instead of keeping the >>> entire value. And since engines don't know what's in the string, the full >>> value must be used in comparison, which is extra work and extra space. >>> >>> Inflated values may not be a problem in some cases. IPv4 addresses are one >>> case where you could argue that it doesn't matter very much that they are >>> typically stored as strings. But I expect the use of UUIDs to be common for >>> ID columns because you can generate them without coordination (unlike an >>> incrementing ID) and that's a concern because the use as an ID makes them >>> likely to be join keys. >>> >>> If we want the values to be stored as 16-byte fixed, then we need to make >>> it easy to get the expected string representation in and out, just like we >>> do with date/time types. I don't think that's specific to any engine. >>> >>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <jacquesnad...@gmail.com> >>>> wrote: >>>> I think points 1&2 don't really apply since a fixed width binary already >>>> covers those properties. >>>> >>>> It seems like this isn't really a concern of iceberg but rather a cosmetic >>>> layer that exists primarily (only?) in trino. In that case I would be >>>> inclined to say that trino should just use custom metadata and a fixed >>>> binary type. That way you still have the desired ux without exposing those >>>> extra concepts to the iceberg. It actually feels like better >>>> encapsulation imo. >>>> >>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com> >>>>> wrote: >>>>> Hi, >>>>> >>>>> I agree with Ryan, that it takes some precautions before one can assume >>>>> uniqueness of UUID values, and that this shouldn't be any special for >>>>> UUIDs at all. >>>>> After all, this is just a primitive type, which is commonly used for >>>>> certain things, but "commonly" doesn't mean "always". >>>>> >>>>> The advantages of having a dedicated type are on 3 layers. >>>>> The compact representation in the file, and compact representation in >>>>> memory in the query engine are the ones mentioned above. >>>>> >>>>> The third layer is the usability. Seeing a UUID column i know what values >>>>> i can expect, so it's more descriptive than `id char(36)`. >>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without >>>>> need for casting to varchar. >>>>> It also removes temptation of casting uuid to varbinary to achieve >>>>> compact representation. >>>>> >>>>> Thus i think it would be good to have them. >>>>> >>>>> Best >>>>> PF >>>>> >>>>> >>>>> >>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <b...@tabular.io> wrote: >>>>>> The original reason why I added UUID to the spec was that I thought >>>>>> there would be opportunities to take advantage of UUIDs as unique values >>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID >>>>>> fields and how we might do something similar in Iceberg. >>>>>> >>>>>> The reason we have thought about removing UUID is that there aren't as >>>>>> many opportunities to take advantage of UUIDs as I thought. My original >>>>>> assumption was that we could do things like bucket on UUID fields or >>>>>> assume that a UUID field has a high NDV. But that's not necessarily the >>>>>> case with when a UUID field is a foreign key, only when it is used as an >>>>>> identifier or primary key. Before Jack added tracking for row identifier >>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, >>>>>> we didn't invest in support for UUID. >>>>>> >>>>>> Quick aside: Now that row identifier fields are tracked, we can do some >>>>>> of these things with the row identifier fields. Engines can assume that >>>>>> the tuple of row identifier fields is unique in a table for join >>>>>> estimation. And engines can use row identifier fields in sort keys to >>>>>> ensure lots of partition split locations (this is really important for >>>>>> Spark). >>>>>> >>>>>> Coming back to UUIDs, the second reason to have a UUID type is still >>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte >>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16 >>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in >>>>>> joins, this could really help engines as long as they can keep the >>>>>> values as fixed-width binary. >>>>>> >>>>>> I could go either way on this. I think it is valuable to have a compact >>>>>> representation for UUIDs rather than using the string representation. >>>>>> But that will require investing in the type and building support in >>>>>> engines that won't take advantage of it. If Trino can use this, I think >>>>>> it may be worth keeping and investing in. >>>>>> >>>>>> Ryan >>>>>> >>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end. I >>>>>>> think It is more about user experience, whether the conversion is done >>>>>>> at the user side or Iceberg and engine side. Many people just store >>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an >>>>>>> explicit UUID type, Iceberg can optimize this common use case >>>>>>> internally for users. There might be some other benefits I overlooked, >>>>>>> but maybe the complication introduced by this type does not really >>>>>>> justify the slightly better user experience. I am also on the fence >>>>>>> about it. >>>>>>> >>>>>>> -Jack Ye >>>>>>> >>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau >>>>>>>> <jacquesnad...@gmail.com> wrote: >>>>>>>> What specific arguments are there for it being a first class type >>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or >>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed >>>>>>>> width binary seems to cover the cases I see in terms of actual >>>>>>>> functionality in the iceberg libraries or engines… >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yyany...@gmail.com> wrote: >>>>>>>>> One conversation I used to come across regarding UUID deprecation was >>>>>>>>> from https://github.com/apache/iceberg/pull/1611 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Yan >>>>>>>>> >>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary >>>>>>>>>> <pv...@cloudera.com.invalid> wrote: >>>>>>>>>> Hi Joshua, >>>>>>>>>> >>>>>>>>>> I do not have a strong preference about the UUID type, but I would >>>>>>>>>> like the highlight, that the type is handled inconsistently in >>>>>>>>>> Iceberg with different file formats. (See: >>>>>>>>>> https://github.com/apache/iceberg/issues/1881) >>>>>>>>>> >>>>>>>>>> If we keep the type, it would be good to standardize the handling in >>>>>>>>>> every file format. >>>>>>>>>> >>>>>>>>>> Thanks, Peter >>>>>>>>>> >>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <joshthow...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> Hi. >>>>>>>>>>> >>>>>>>>>>> UUID is a current data type according to the Iceberg spec >>>>>>>>>>> (https://iceberg.apache.org/spec/#primitive-types), but there seems >>>>>>>>>>> to have been some discussion about removing it? I could not find >>>>>>>>>>> the original discussion, but a reference to the discussion can be >>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663). >>>>>>>>>>> >>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep >>>>>>>>>>> UUID in Iceberg. To summarize… >>>>>>>>>>> >>>>>>>>>>> - It makes sense to keep the type now that row identifiers are >>>>>>>>>>> supported >>>>>>>>>>> - Some engines (Trino) have support for the UUID type >>>>>>>>>>> - Engines w/o support for UUID type can determine how to map >>>>>>>>>>> >>>>>>>>>>> Does anyone want to remove the type? If so, why? >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>> >>> >>> -- >>> Ryan Blue >>> Tabular > > > -- > Ryan Blue > Tabular