Re: [DISCUSS] UUID type

Kyle B Sun, 01 Aug 2021 21:22:50 -0700

Hi Ryan and all,

That sounds like a reasonable reason to leave IP address types out. In my 
experience, dedicated IP address types are mostly found in logging tools and 
other things for sysadmins / DevOps etc.


When querying data with IP addresses, I’ve seen it done quite a lot (eg 
security reasons) but usually stored as string or manipulated in a UDF. They’re 
not commonly supported types.

I would also draw the line at UUID types.

- Kyle Bendickson

> On Jul 30, 2021, at 3:15 PM, Ryan Blue <[email protected]> wrote:
> 
> 
> Jacques, you make some good points here. I think my argument about usability 
> leading to performance issues is a stronger argument for engines than for 
> Iceberg. Still, there are inefficiencies in Iceberg if someone chooses to use 
> a string in an engine that doesn't have a UUID type.
> 
> Another thing to consider is cross-engine support. If Iceberg removes UUID, 
> then Trino would probably translate to fixed[16]. That results in a table 
> that's difficult to query in other engines, where people would probably 
> choose to store the data as a string. On the other hand, if Iceberg keeps the 
> UUID type then integrations would simply translate to the UUID string 
> representation before passing data to the other engines. While the engines 
> would be using 36-byte values in join keys, the user experience issue is 
> fixed and the data is more compact on disk and in Iceberg's bounds metadata.
> 
> While having a UUID type in Iceberg can't really help engines that don't 
> support UUID take advantage of the type at runtime, it does seem slightly 
> better to have the UUID type in general since at least one engine supports it 
> and it provides the expected user experience with a compact representation.
> 
> IPv4 addresses are a good thing to think about as well, since most of the 
> same arguments apply. If we keep the UUID type, should we also add IPv4 or 
> IPv6 types? I would probably draw the line at UUID because it helps in joins, 
> which are an important operation. IPv4 representations aren't that big of an 
> inconvenience unless you need to do IP manipulation, which is typically in a 
> UDF and not the query engine. And you can always keep both representations in 
> a table fairly inexpensively. Does this sound like a valid rationale for 
> having UUID but not IP types?
> 
> Ryan
> 
>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <[email protected]> 
>> wrote:
>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native type. 
>> Which engines are you thinking of that have a native UUID type besides the 
>> Presto derivatives and support Iceberg?
>> 
>> I agree that Trino should expose a UUID type on top of Iceberg tables. All 
>> the user experience things that you are describing as important (compact 
>> storage, friendly display, ddl, clean literals) are possible without it 
>> being a first class type in Iceberg using a trino specific property.
>> 
>> I don't really have a strong opinion about UUID. In general, type bloat is 
>> probably just a part of this kind of project. Generally, CHAR(X) and 
>> VARCHAR(X) feel like much bigger concerns given that they exist in all of 
>> the engines but not Iceberg--especially when we start talking about views.
>> 
>> Some of this argues for physical vs logical type abstraction. (Something 
>> that was always challenging in Parquet but also helped to resolve how these 
>> types are managed in engines that don't support them.)
>> 
>> thanks,
>> Jacques
>> 
>> PS: Funny aside, the bloat on an ip address is actually worse than a UUID, 
>> right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID 
>> 36/16 => 125% bloat.
>> 
>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <[email protected]> wrote:
>>> I don't think this is just a problem in Trino.
>>> 
>>> If there is no UUID type, then a user must choose between a 36-byte string 
>>> and a 16-byte binary. That's not a good choice to force people into. If 
>>> someone chooses binary, then it's harder to work with rows and construct 
>>> queries even though there is a standard representation for UUIDs. To avoid 
>>> the user headache, people will probably choose to store values as strings. 
>>> Using a string would mean that more than half the value is needlessly 
>>> discarded by default in Iceberg lower/upper bounds instead of keeping the 
>>> entire value. And since engines don't know what's in the string, the full 
>>> value must be used in comparison, which is extra work and extra space.
>>> 
>>> Inflated values may not be a problem in some cases. IPv4 addresses are one 
>>> case where you could argue that it doesn't matter very much that they are 
>>> typically stored as strings. But I expect the use of UUIDs to be common for 
>>> ID columns because you can generate them without coordination (unlike an 
>>> incrementing ID) and that's a concern because the use as an ID makes them 
>>> likely to be join keys.
>>> 
>>> If we want the values to be stored as 16-byte fixed, then we need to make 
>>> it easy to get the expected string representation in and out, just like we 
>>> do with date/time types. I don't think that's specific to any engine.
>>> 
>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <[email protected]> 
>>>> wrote:
>>>> I think points 1&2 don't really apply since a fixed width binary already 
>>>> covers those properties. 
>>>> 
>>>> It seems like this isn't really a concern of iceberg but rather a cosmetic 
>>>> layer that exists primarily (only?) in trino. In that case I would be 
>>>> inclined to say that trino should just use custom metadata and a fixed 
>>>> binary type. That way you still have the desired ux without exposing those 
>>>> extra concepts to the  iceberg. It actually feels like better 
>>>> encapsulation imo. 
>>>> 
>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <[email protected]> 
>>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> I agree with Ryan, that it takes some precautions before one can assume 
>>>>> uniqueness of UUID values, and that this shouldn't be any special for 
>>>>> UUIDs at all.
>>>>> After all, this is just a primitive type, which is commonly used for 
>>>>> certain things, but "commonly" doesn't mean "always".
>>>>> 
>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>> The compact representation in the file, and compact representation in 
>>>>> memory in the query engine are the ones mentioned above.
>>>>> 
>>>>> The third layer is the usability. Seeing a UUID column i know what values 
>>>>> i can expect, so it's more descriptive than `id char(36)`.
>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without 
>>>>> need for casting to varchar.
>>>>> It also removes temptation of casting uuid to varbinary to achieve 
>>>>> compact representation.
>>>>> 
>>>>> Thus i think it would be good to have them.
>>>>> 
>>>>> Best
>>>>> PF
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <[email protected]> wrote:
>>>>>> The original reason why I added UUID to the spec was that I thought 
>>>>>> there would be opportunities to take advantage of UUIDs as unique values 
>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID 
>>>>>> fields and how we might do something similar in Iceberg.
>>>>>> 
>>>>>> The reason we have thought about removing UUID is that there aren't as 
>>>>>> many opportunities to take advantage of UUIDs as I thought. My original 
>>>>>> assumption was that we could do things like bucket on UUID fields or 
>>>>>> assume that a UUID field has a high NDV. But that's not necessarily the 
>>>>>> case with when a UUID field is a foreign key, only when it is used as an 
>>>>>> identifier or primary key. Before Jack added tracking for row identifier 
>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, 
>>>>>> we didn't invest in support for UUID.
>>>>>> 
>>>>>> Quick aside: Now that row identifier fields are tracked, we can do some 
>>>>>> of these things with the row identifier fields. Engines can assume that 
>>>>>> the tuple of row identifier fields is unique in a table for join 
>>>>>> estimation. And engines can use row identifier fields in sort keys to 
>>>>>> ensure lots of partition split locations (this is really important for 
>>>>>> Spark).
>>>>>> 
>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still 
>>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte 
>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16 
>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in 
>>>>>> joins, this could really help engines as long as they can keep the 
>>>>>> values as fixed-width binary.
>>>>>> 
>>>>>> I could go either way on this. I think it is valuable to have a compact 
>>>>>> representation for UUIDs rather than using the string representation. 
>>>>>> But that will require investing in the type and building support in 
>>>>>> engines that won't take advantage of it. If Trino can use this, I think 
>>>>>> it may be worth keeping and investing in.
>>>>>> 
>>>>>> Ryan
>>>>>> 
>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <[email protected]> wrote:
>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end. I 
>>>>>>> think It is more about user experience, whether the conversion is done 
>>>>>>> at the user side or Iceberg and engine side. Many people just store 
>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an 
>>>>>>> explicit UUID type, Iceberg can optimize this common use case 
>>>>>>> internally for users. There might be some other benefits I overlooked, 
>>>>>>> but maybe the complication introduced by this type does not really 
>>>>>>> justify the slightly better user experience. I am also on the fence 
>>>>>>> about it.
>>>>>>> 
>>>>>>> -Jack Ye
>>>>>>> 
>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> What specific arguments are there for it being a first class type 
>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or 
>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed 
>>>>>>>> width binary seems to cover the cases I see in terms of actual 
>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <[email protected]> wrote:
>>>>>>>>> One conversation I used to come across regarding UUID deprecation was 
>>>>>>>>> from https://github.com/apache/iceberg/pull/1611 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Yan
>>>>>>>>> 
>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> Hi Joshua, 
>>>>>>>>>> 
>>>>>>>>>> I do not have a strong preference about the UUID type, but I would 
>>>>>>>>>> like the highlight, that the type is handled inconsistently in 
>>>>>>>>>> Iceberg with different file formats. (See: 
>>>>>>>>>> https://github.com/apache/iceberg/issues/1881) 
>>>>>>>>>> 
>>>>>>>>>> If we keep the type, it would be good to standardize the handling in 
>>>>>>>>>> every file format. 
>>>>>>>>>> 
>>>>>>>>>> Thanks, Peter 
>>>>>>>>>> 
>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> Hi. 
>>>>>>>>>>> 
>>>>>>>>>>> UUID is a current data type according to the Iceberg spec 
>>>>>>>>>>> (https://iceberg.apache.org/spec/#primitive-types), but there seems 
>>>>>>>>>>> to have been some discussion about removing it? I could not find 
>>>>>>>>>>> the original discussion, but a reference to the discussion can be 
>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663). 
>>>>>>>>>>> 
>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep 
>>>>>>>>>>> UUID in Iceberg. To summarize… 
>>>>>>>>>>> 
>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are 
>>>>>>>>>>> supported
>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Ryan Blue
>>>>>> Tabular
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Tabular
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] UUID type

Reply via email to