alejandro-anadon commented on PR #1458:
URL: https://github.com/apache/phoenix/pull/1458#issuecomment-1181518834

   hi,
   
   if I may, I will start with the last sentence:"for a fixed length type 
having 0x00 or 0xFF as part of the rowkey should be fine for indexing".
   
   This is 100% correct. But the problem is when you want to make an index with 
a fixed length type that is not part of the rowkey.
   
   Since it is not part of the rowkey, it can be null.
   
   And there is the need to create a variable length type, PUUIDIndexable (and 
that can not have neither 0x00 or 0xFF) although when it comes to the truth, it 
will always have the same length; but it has to be declared as a variable 
length.
   
   this translate is performed in the class 
"org.apache.phoenix.util.org.apache.phoenix.util" in the method 
"getIndexColumnDataType":
   
   ...
       // Since we cannot have nullable fixed length in a row key
       // we need to translate to variable length. The verification that we 
have a valid index
       // row key was already done, so here we just need to convert from one 
built-in type to
       // another.
       public static PDataType getIndexColumnDataType(boolean isNullable, 
PDataType dataType) {
   ....
   
   For example, the INTEGER type is translated to DECIMAL when you want to make 
an index and it is not PK. look at the following example with the field
   "NOT_IN_PK". In the table it is of type INTEGER, but in the index it is 
translated to type DECIMAL:
        
   ..............
   0: jdbc:phoenix:> CREATE TABLE DUMMY (ID INTEGER NOT NULL, NOT_IN_PK 
INTEGER, CONSTRAINT NAME_PK PRIMARY KEY(ID));
   No rows affected (4.273 seconds)
   0: jdbc:phoenix:> !describe DUMMY
   
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
   | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME 
| COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX | NULLABLE | R |
   
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
   |           |             | DUMMY      | ID          | 4         | INTEGER   
| null        | null          | null           | null           | 0        |   |
   |           |             | DUMMY      | NOT_IN_PK   | 4         | INTEGER   
| null        | null          | null           | null           | 1        |   |
   
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
   0: jdbc:phoenix:> CREATE INDEX dummy_idx ON DUMMY(NOT_IN_PK);
   No rows affected (7.513 seconds)
   0: jdbc:phoenix:> !describe dummy_idx
   
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
   | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME 
| COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX | NULLABLE | R |
   
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
   |           |             | DUMMY_IDX  | 0:NOT_IN_PK | 3         | DECIMAL   
| null        | null          | null           | null           | 1        |   |
   |           |             | DUMMY_IDX  | :ID         | 4         | INTEGER   
| null        | null          | null           | null           | 0        |   |
   
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
   .............
   
   In the case of UUID, it would be exactly the same:
   
   - if it is in PK (NOT NULL) or it is a field outside the PK (NULLABLE), it 
would be of type UUID.
   - But if we want to make an index on the field that is not PK, the index is 
translated to UUID_INDEXABLE:
   
   .......
   0: jdbc:phoenix:> CREATE TABLE DUMMY_WITH_UUID (ID UUID NOT NULL, NOT_IN_PK 
UUID, CONSTRAINT NAME_PK PRIMARY KEY(ID));
   No rows affected (1.241 seconds)
   0: jdbc:phoenix:> !describe DUMMY_WITH_UUID
   
+-----------+-------------+-----------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+---------+
   | TABLE_CAT | TABLE_SCHEM |   TABLE_NAME    | COLUMN_NAME | DATA_TYPE | 
TYPE_NAME | COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX | 
NULLABL |
   
+-----------+-------------+-----------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+---------+
   |           |             | DUMMY_WITH_UUID | ID          | 2100      | UUID 
     | null        | null          | null           | null           | 0       |
   |           |             | DUMMY_WITH_UUID | NOT_IN_PK   | 2100      | UUID 
     | null        | null          | null           | null           | 1       |
   
+-----------+-------------+-----------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+---------+
   0: jdbc:phoenix:>  CREATE INDEX dummy_with_uuid_idx ON 
DUMMY_WITH_UUID(NOT_IN_PK);
   No rows affected (7.341 seconds)
   0: jdbc:phoenix:> !describe dummy_with_uuid_idx
   
+-----------+-------------+---------------------+-------------+-----------+----------------+-------------+---------------+----------------+-----------------+
   | TABLE_CAT | TABLE_SCHEM |     TABLE_NAME      | COLUMN_NAME | DATA_TYPE |  
 TYPE_NAME    | COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX  |
   
+-----------+-------------+---------------------+-------------+-----------+----------------+-------------+---------------+----------------+-----------------+
   |           |             | DUMMY_WITH_UUID_IDX | 0:NOT_IN_PK | 2101      | 
UUID_INDEXABLE | null        | null          | null           | null            
|
   |           |             | DUMMY_WITH_UUID_IDX | :ID         | 2100      | 
UUID           | null        | null          | null           | null            
|
   
+-----------+-------------+---------------------+-------------+-----------+----------------+-------------+---------------+----------------+-----------------+
   ..............
   
   This would be the only (but very important) case in which the UUID_INDEXABLE 
type would be used and it would be at internal level. What would have to be 
done is not to show (or not to recommend its use) to the public.
   
   
   Going back to the beginning of the comment, in which you say that when you 
have seen UUIDs used, what is done is to use "a uuid generator in a string or 
similar and attempt to push it".
   it is true that it can be done that way.
   
   But that can be done with all data types. You can always convert, for 
example, an integer to String with 'Integer.toString(i)' and do a push.
   And when retrieved do an 'Integer.parseInt(rs.getString(1))'.
   
   Why not do that with integers, doubles, etc.? because:
   1) firstly prevents errors of type 'ps.setString(1,Integer.parseInt("not a 
number"))' making it much safer and with fewer errors than with a 
'ps.setInt(1,i)'.
   2) it is much more convenient for the programmer not to have to deal with 
conversions from one type to another.
   
   
   The question is: Does UUID have enough entity to have its own type? Since 
this type is being more and more widely used and many databases are 
implementing it (for example, 
https://www.postgresql.org/docs/current/datatype-uuid.html), in my opinion it 
does. 
   
   
   Regarding the use case you are asking about, I have given as an example the 
use as PK. But there are more interesting use cases.
   
   Some examples are (text extracted from 
https://www.mysqltutorial.org/mysql-uuid/ ):
   
        "[...]Using UUID for a primary key  brings the following advantages:
   
       UUID values are unique across tables, databases, and even servers that 
allow you to merge rows from different databases 
                or distribute databases across servers.
       UUID values do not expose the information about your data so they are 
safer to use in a URL. For example, if a customer 
                with id 10 accesses his account via 
http://www.example.com/customers/10/ URL, it is easy to guess that there is a 
                customer 11, 12, etc., and this could be a target for an attack.
       UUID values can be generated anywhere that avoid a round trip to the 
database server. It also simplifies logic in the application. 
                For example, to insert data into a parent table and child 
tables, you have to insert into the parent table first, 
                get generated id and then insert data into the child tables. By 
using UUID, you can generate the primary key value
                of the parent table up front and insert rows into both parent 
and child tables at the same time within a transaction."
                
   Of course it has some cons, but I think the benefits outweigh the cons.
   
   
   The most difficult part to solve is the possibility of a UUID collision at 
the moment of its generation and that it coincides with one that is in the 
database.
   in Phoenix.
   
   It would be perfect if Phoenix would include something that guarantees that 
uniqueness when doing the upsert. As you say, to differentiate a "put" from an 
"update".
   But that is far from my knowledge. What I can say is that this dilemma 
occurs with any table that is created in phoenix:
   
   ...
    CREATE TABLE DUMMY_WITH_CHAR (ID CHAR(16) NOT NULL, NOT_IN_PK CHAR(16), 
CONSTRAINT NAME_PK PRIMARY KEY(ID));
   ...
   
   Is there any mechanism (except a select prior to the upsert) that when doing 
an upsert guarantees me that the id does not exist previously? to the best of 
my knowledge no.
   (I guess that's why there is "upsert" (insert or update) and not "insert" or 
"update").
   
   Returning to the subject of collisions, while they are mathematically 
possible, 
   as well as a possible SHA256 collision used by banks, the chances of 
collision in UUIDs are close to zero.
   
   Considering the scope of use of Phoenix, the consequences of a collision are 
assumable. I mean:
   if I had to create a military, medical or banking system, where the 
repercussions of a UUID collision would be severe, I would not use Phoenix.
   But if I use Phoenix for systems with huge masses of data (which is its 
ultimate goal), a remote UUID collision, I would consider it acceptable.
   And if in the end I decide to use it and a collision is critical, I would do 
a select prior to upsert.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to