[PR] PHOENIX-7357 New variable length binary data type: VARBINARY_ENCODED [phoenix]

via GitHub Thu, 11 Jul 2024 23:34:39 -0700


virajjasani opened a new pull request, #1937:
URL: https://github.com/apache/phoenix/pull/1937

Jira: PHOENIX-7357

Design Doc:
https://docs.google.com/document/d/1-eZ4cbLVEPC4z7Iy1U_VJgNaKC9odkRVzcVPV3ZqOcI/edit?usp=sharing

As of today, Phoenix provides several variable length as well as fixed
length data types. One of the variable length data types is VARBINARY. It is
variable length binary blob. Using VARBINARY as only primary key can be
considered as if using HBase row key.

HBase provides a single row key. Any client application that requires using
more than one column for primary keys, using HBase requires special handling of
storing both column values as a single binary row key. Phoenix provides the
ability to use more than one primary key by providing composite primary keys.
Composite primary key can contain any number of primary key columns. Phoenix
also provides the ability to add new nullable primary key columns to the
existing composite primary keys. Phoenix uses HBase as its backing store. In
order to provide the ability for users to define multiple primary keys, Phoenix
internally concatenates binary encoded values of each primary key column value
and uses concatenated binary value as HBase row key. In order to efficiently
concatenate as well as retrieve individual primary key values, Phoenix
implements two ways:

For fixed length columns: The length of the given column is determined by
the maximum length of the column. As part of the read flow, while iterating
through the row key, fixed length numbers of bytes are retrieved while reading.
While writing, if the original encoded value of the given column has less
number of bytes, additional null bytes (\x00) are padded until the fixed length
is filled up. Hence, for smaller values, we end up wasting some space.
For variable length columns: Since we cannot know the length of the value of
variable length data type in advance, a separator or terminator byte is used.
Phoenix uses null byte as separator (\x00) byte. As of today, VARCHAR is the
most commonly used variable length data type and since VARCHAR represents
String, null byte is not part of valid String characters. Hence, it can be
effectively used to determine when to terminate the given VARCHAR value.

The null byte (\x00) works fine as a separator for VARCHAR. However, it
cannot be used as a separator byte for VARBINARY because VARBINARY can contain
any binary blob values. Due to this, Phoenix has restrictions for VARBINARY
type:

It can only be used as the last part of the composite primary key.
It cannot be used as a DESC order primary key column.

Using VARBINARY data type as an earlier portion of the composite primary key
is a valid use case. One can also use multiple VARBINARY primary key columns.
After all, Phoenix provides the ability to use multiple primary key columns for
users.

Besides, using secondary index on data table means that the composite
primary key of secondary index table includes:

```
<secondary-index-col1> <secondary-index-col2> … <secondary-index-colN>
<primary-key-col1> <primary-key-col2> … <primary-key-colN>
```

As primary key columns are appended to the secondary indexes columns, one
cannot create a secondary index on any VARBINARY column.

The proposal of this Jira is to introduce new data type VARBINARY_ENCODED,
which has no restriction of being considered as composite primary key prefix or
using it as DESC ordered column.

This means, we need to effectively distinguish where the variable length
binary data terminates in the absence of fixed length information.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] PHOENIX-7357 New variable length binary data type: VARBINARY_ENCODED [phoenix]

Reply via email to