Viraj Jasani created PHOENIX-7357:
-------------------------------------

             Summary: New variable length binary data type: VARBINARY_ENCODED
                 Key: PHOENIX-7357
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7357
             Project: Phoenix
          Issue Type: New Feature
            Reporter: Viraj Jasani
            Assignee: Viraj Jasani
             Fix For: 5.3.0


As of today, Phoenix provides several variable length as well as fixed length 
data types. One of the variable length data types is VARBINARY. It is variable 
length binary blob. Using VARBINARY as only primary key can be considered as if 
using HBase row key.

HBase provides a single row key. Any client application that requires using 
more than one column for primary keys, using HBase requires special handling of 
storing both column values as a single binary row key. Phoenix provides the 
ability to use more than one primary key by providing composite primary keys. 
Composite primary key can contain any number of primary key columns. Phoenix 
also provides the ability to add new nullable primary key columns to the 
existing composite primary keys. Phoenix uses HBase as its backing store. In 
order to provide the ability for users to define multiple primary keys, Phoenix 
internally concatenates binary encoded values of each primary key column value 
and uses concatenated binary value as HBase row key. In order to efficiently 
concatenate as well as retrieve individual primary key values, Phoenix 
implements two ways:
 # For fixed length columns: The length of the given column is determined by 
the maximum length of the column. As part of the read flow, while iterating 
through the row key, fixed length numbers of bytes are retrieved while reading. 
While writing, if the original encoded value of the given column has less 
number of bytes, additional null bytes (\x00) are padded until the fixed length 
is filled up. Hence, for smaller values, we end up wasting some space.
 # For variable length columns: Since we cannot know the length of the value of 
variable length data type in advance, a separator or terminator byte is used. 
Phoenix uses null byte as separator (\x00) byte. As of today, VARCHAR is the 
most commonly used variable length data type and since VARCHAR represents 
String, null byte is not part of valid String characters. Hence, it can be 
effectively used to determine when to terminate the given VARCHAR value.

 

The null byte (\x00) works fine as a separator for VARCHAR. However, it cannot 
be used as a separator byte for VARBINARY because VARBINARY can contain any 
binary blob values. Due to this, Phoenix has restrictions for VARBINARY type: 

 
 # It can only be used as the last part of the composite primary key.
 # It cannot be used as a DESC order primary key column.

 

Using VARBINARY data type as an earlier portion of the composite primary key is 
a valid use case. One can also use multiple VARBINARY primary key columns. 
After all, Phoenix provides the ability to use multiple primary key columns for 
users.

Besides, using secondary index on data table means that the composite primary 
key of secondary index table includes: 

<secondary-index-col1> <secondary-index-col2> … <secondary-index-colN> 
<primary-key-col1> <primary-key-col2> … <primary-key-colN>

 

As primary key columns are appended to the secondary indexes columns, one 
cannot create a secondary index on any VARBINARY column.

The proposal of this Jira is to introduce new data type 
{*}VARBINARY_ENCODED{*}, which has no restriction of being considered as 
composite primary key prefix or using it as DESC ordered column.

This means, we need to effectively distinguish where the variable length binary 
data terminates in the absence of fixed length information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to