UUID/GUID discussion leading to request for hexstring bytea? (was: Re: [HACKERS] TODO: GUID datatype)

mark Wed, 06 Sep 2006 14:36:48 -0700

On Wed, Sep 06, 2006 at 07:49:09PM +0200, Gevik Babakhani wrote:
> I would like to know if anyone is working on the GUID datatype.
> If not, I am going to work on it. Please let me know.


Assuming you mean GUID in the same sense as UUID, there are many
non-core developers who would like to see it, or who have worked
on one.

There are at least three floating around that I am aware of, one of
which I contributed to, that works well, and one that I imagine works
well (the OSSP UUID library includes a PostgreSQL binding) both of
which include support for a UUID type, including the desired index
support and plpgsql function calls for generating, and manipulating
the UUID values.

I'm a little stuck at the moment, as I have time sapped away from me
by things like real work, and I'm trying to prepare a submission that
would be acceptable for core. I've read the thoughts of some of the
core developers on this subject, and agree with many of them.

The UUID type itself has value, however, the value it provides is
limited. Generation of a UUID doesn't have to occur with the database.
The application inserting the row can generate the UUID. The UUID type
itself has limited value, in that the difference between a 36 bytes +
4 bytes = 40 bytes as a fully expanded char(40), or the packed value
using the hexstring encode/decode SQL functions of a bytea type at 16
+ 4 bytes = 20 bytes compared to a tightly packed UUID type of 16
bytes, are very close. The argument can easily be made that if space
(disk space, index size, I/O bandwidth) is your first priority, than
a UUID is the wrong type to use. A 64-bit integer may suffice.

I'm also having trouble with the idea that a UUID deserves special
treatment. I currently have a desire to store both UUID and MD5
checksum in my rows. They are both 128 bits = 16 bytes, and fit all
of the same requirements above.

I would really like to have an MD5 checksum type now for the same
reason. It has a reasonable use that few could deny. Perhaps an MD5
checksum type would be more frequently used than a UUID type? More
systems these days are using the MD5 checksum as a unique identifier
for content. It has a few clever advantages. Assuming it really is
well distributed, and extremely unlikely for overlap to occur within
a system, the MD5 checksum has the advantage of automatically dealing
with duplicate values. In my case, I have chosen to identify uploaded
jpeg images by their MD5 checksum.

This makes it seem as if a generic 128-bit data type would be
desirable. They both have a compatible representation of a hexadecimal
string. The extra '-' characters in the UUID can be easily added when
necessary by a HEXSTRING2UUID() sort of function in plpgsql or in
the claling application.

But - MD5 isn't the only checksum that is frequently used. Some argue
that the MD5 can be shown to be weak in some regards, and that perhaps
other checksum algorithms such as SHA-1 provide a better guarantee of
uniqueness. SHA-1 isn't 128-bits. It's 160-bits.

This is where I start to buy Tom Lane's argument that the 4-byte
prefix is no big deal. I find it more desirable to have a binary data
type with a hexadecimal string input and output function. The
flexibility of being able to use 128-bits or 160-bits is worth this
4-byte header to me. What I don't want to do is store double size
fields, stored as hexadecimal.

This leads to a few options:

    1) Create specific types as necessary, with associated functions.
       No overhead.
           - uuid, md5sum, sha1sum, ...

    2) Create semi-generic types with common bitlengths. Associated
       functions work on these semi-generic types. No overhead.
           - hexstring128, hexstring160, ...

    3) Create a new bytea type that has ascii input and output formats,
       probably based around hexstrings. Overhead of 4 bytes.

    4) Use varchar to store these types, and provide associated
       functions that return character strings in the right format.
       This follows the existing md5sum() PostgreSQL function. Overhead
       is double the size of the data.

    5) Use bytea to store these types, and the encode/decode functions
       are passed character strings in the right format. Possibly
       complicated for the application to deal with, as well as a user
       typing SQL commands. Overhead of 4 bytes.

As I said - I'm stuck. Not sure which way to go. I currently use a uuid
type that I and another person on this list wrote against the OSSP UUID
library. For the MD5 checksum, I use char(32).

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED]     
__________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

UUID/GUID discussion leading to request for hexstring bytea? (was: Re: [HACKERS] TODO: GUID datatype)

Reply via email to