alejandro-anadon commented on PR #1458:
URL: https://github.com/apache/phoenix/pull/1458#issuecomment-1181518834
hi,
if I may, I will start with the last sentence:"for a fixed length type
having 0x00 or 0xFF as part of the rowkey should be fine for indexing".
This is 100% correct. But the problem is when you want to make an index with
a fixed length type that is not part of the rowkey.
Since it is not part of the rowkey, it can be null.
And there is the need to create a variable length type, PUUIDIndexable (and
that can not have neither 0x00 or 0xFF) although when it comes to the truth, it
will always have the same length; but it has to be declared as a variable
length.
this translate is performed in the class
"org.apache.phoenix.util.org.apache.phoenix.util" in the method
"getIndexColumnDataType":
...
// Since we cannot have nullable fixed length in a row key
// we need to translate to variable length. The verification that we
have a valid index
// row key was already done, so here we just need to convert from one
built-in type to
// another.
public static PDataType getIndexColumnDataType(boolean isNullable,
PDataType dataType) {
....
For example, the INTEGER type is translated to DECIMAL when you want to make
an index and it is not PK. look at the following example with the field
"NOT_IN_PK". In the table it is of type INTEGER, but in the index it is
translated to type DECIMAL:
..............
0: jdbc:phoenix:> CREATE TABLE DUMMY (ID INTEGER NOT NULL, NOT_IN_PK
INTEGER, CONSTRAINT NAME_PK PRIMARY KEY(ID));
No rows affected (4.273 seconds)
0: jdbc:phoenix:> !describe DUMMY
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME
| COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX | NULLABLE | R |
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
| | | DUMMY | ID | 4 | INTEGER
| null | null | null | null | 0 | |
| | | DUMMY | NOT_IN_PK | 4 | INTEGER
| null | null | null | null | 1 | |
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
0: jdbc:phoenix:> CREATE INDEX dummy_idx ON DUMMY(NOT_IN_PK);
No rows affected (7.513 seconds)
0: jdbc:phoenix:> !describe dummy_idx
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME
| COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX | NULLABLE | R |
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
| | | DUMMY_IDX | 0:NOT_IN_PK | 3 | DECIMAL
| null | null | null | null | 1 | |
| | | DUMMY_IDX | :ID | 4 | INTEGER
| null | null | null | null | 0 | |
+-----------+-------------+------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---+
.............
In the case of UUID, it would be exactly the same:
- if it is in PK (NOT NULL) or it is a field outside the PK (NULLABLE), it
would be of type UUID.
- But if we want to make an index on the field that is not PK, the index is
translated to UUID_INDEXABLE:
.......
0: jdbc:phoenix:> CREATE TABLE DUMMY_WITH_UUID (ID UUID NOT NULL, NOT_IN_PK
UUID, CONSTRAINT NAME_PK PRIMARY KEY(ID));
No rows affected (1.241 seconds)
0: jdbc:phoenix:> !describe DUMMY_WITH_UUID
+-----------+-------------+-----------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+---------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE |
TYPE_NAME | COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX |
NULLABL |
+-----------+-------------+-----------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+---------+
| | | DUMMY_WITH_UUID | ID | 2100 | UUID
| null | null | null | null | 0 |
| | | DUMMY_WITH_UUID | NOT_IN_PK | 2100 | UUID
| null | null | null | null | 1 |
+-----------+-------------+-----------------+-------------+-----------+-----------+-------------+---------------+----------------+----------------+---------+
0: jdbc:phoenix:> CREATE INDEX dummy_with_uuid_idx ON
DUMMY_WITH_UUID(NOT_IN_PK);
No rows affected (7.341 seconds)
0: jdbc:phoenix:> !describe dummy_with_uuid_idx
+-----------+-------------+---------------------+-------------+-----------+----------------+-------------+---------------+----------------+-----------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE |
TYPE_NAME | COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX |
+-----------+-------------+---------------------+-------------+-----------+----------------+-------------+---------------+----------------+-----------------+
| | | DUMMY_WITH_UUID_IDX | 0:NOT_IN_PK | 2101 |
UUID_INDEXABLE | null | null | null | null
|
| | | DUMMY_WITH_UUID_IDX | :ID | 2100 |
UUID | null | null | null | null
|
+-----------+-------------+---------------------+-------------+-----------+----------------+-------------+---------------+----------------+-----------------+
..............
This would be the only (but very important) case in which the UUID_INDEXABLE
type would be used and it would be at internal level. What would have to be
done is not to show (or not to recommend its use) to the public.
Going back to the beginning of the comment, in which you say that when you
have seen UUIDs used, what is done is to use "a uuid generator in a string or
similar and attempt to push it".
it is true that it can be done that way.
But that can be done with all data types. You can always convert, for
example, an integer to String with 'Integer.toString(i)' and do a push.
And when retrieved do an 'Integer.parseInt(rs.getString(1))'.
Why not do that with integers, doubles, etc.? because:
1) firstly prevents errors of type 'ps.setString(1,Integer.parseInt("not a
number"))' making it much safer and with fewer errors than with a
'ps.setInt(1,i)'.
2) it is much more convenient for the programmer not to have to deal with
conversions from one type to another.
The question is: Does UUID have enough entity to have its own type? Since
this type is being more and more widely used and many databases are
implementing it (for example,
https://www.postgresql.org/docs/current/datatype-uuid.html), in my opinion it
does.
Regarding the use case you are asking about, I have given as an example the
use as PK. But there are more interesting use cases.
Some examples are (text extracted from
https://www.mysqltutorial.org/mysql-uuid/ ):
"[...]Using UUID for a primary key brings the following advantages:
UUID values are unique across tables, databases, and even servers that
allow you to merge rows from different databases
or distribute databases across servers.
UUID values do not expose the information about your data so they are
safer to use in a URL. For example, if a customer
with id 10 accesses his account via
http://www.example.com/customers/10/ URL, it is easy to guess that there is a
customer 11, 12, etc., and this could be a target for an attack.
UUID values can be generated anywhere that avoid a round trip to the
database server. It also simplifies logic in the application.
For example, to insert data into a parent table and child
tables, you have to insert into the parent table first,
get generated id and then insert data into the child tables. By
using UUID, you can generate the primary key value
of the parent table up front and insert rows into both parent
and child tables at the same time within a transaction."
Of course it has some cons, but I think the benefits outweigh the cons.
The most difficult part to solve is the possibility of a UUID collision at
the moment of its generation and that it coincides with one that is in the
database.
in Phoenix.
It would be perfect if Phoenix would include something that guarantees that
uniqueness when doing the upsert. As you say, to differentiate a "put" from an
"update".
But that is far from my knowledge. What I can say is that this dilemma
occurs with any table that is created in phoenix:
...
CREATE TABLE DUMMY_WITH_CHAR (ID CHAR(16) NOT NULL, NOT_IN_PK CHAR(16),
CONSTRAINT NAME_PK PRIMARY KEY(ID));
...
Is there any mechanism (except a select prior to the upsert) that when doing
an upsert guarantees me that the id does not exist previously? to the best of
my knowledge no.
(I guess that's why there is "upsert" (insert or update) and not "insert" or
"update").
Returning to the subject of collisions, while they are mathematically
possible,
as well as a possible SHA256 collision used by banks, the chances of
collision in UUIDs are close to zero.
Considering the scope of use of Phoenix, the consequences of a collision are
assumable. I mean:
if I had to create a military, medical or banking system, where the
repercussions of a UUID collision would be severe, I would not use Phoenix.
But if I use Phoenix for systems with huge masses of data (which is its
ultimate goal), a remote UUID collision, I would consider it acceptable.
And if in the end I decide to use it and a collision is critical, I would do
a select prior to upsert.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]