virajjasani opened a new pull request, #1937:
URL: https://github.com/apache/phoenix/pull/1937

   Jira: PHOENIX-7357
   
   Design Doc: 
https://docs.google.com/document/d/1-eZ4cbLVEPC4z7Iy1U_VJgNaKC9odkRVzcVPV3ZqOcI/edit?usp=sharing
   
   As of today, Phoenix provides several variable length as well as fixed 
length data types. One of the variable length data types is VARBINARY. It is 
variable length binary blob. Using VARBINARY as only primary key can be 
considered as if using HBase row key.
   
   HBase provides a single row key. Any client application that requires using 
more than one column for primary keys, using HBase requires special handling of 
storing both column values as a single binary row key. Phoenix provides the 
ability to use more than one primary key by providing composite primary keys. 
Composite primary key can contain any number of primary key columns. Phoenix 
also provides the ability to add new nullable primary key columns to the 
existing composite primary keys. Phoenix uses HBase as its backing store. In 
order to provide the ability for users to define multiple primary keys, Phoenix 
internally concatenates binary encoded values of each primary key column value 
and uses concatenated binary value as HBase row key. In order to efficiently 
concatenate as well as retrieve individual primary key values, Phoenix 
implements two ways:
   
   For fixed length columns: The length of the given column is determined by 
the maximum length of the column. As part of the read flow, while iterating 
through the row key, fixed length numbers of bytes are retrieved while reading. 
While writing, if the original encoded value of the given column has less 
number of bytes, additional null bytes (\x00) are padded until the fixed length 
is filled up. Hence, for smaller values, we end up wasting some space.
   For variable length columns: Since we cannot know the length of the value of 
variable length data type in advance, a separator or terminator byte is used. 
Phoenix uses null byte as separator (\x00) byte. As of today, VARCHAR is the 
most commonly used variable length data type and since VARCHAR represents 
String, null byte is not part of valid String characters. Hence, it can be 
effectively used to determine when to terminate the given VARCHAR value.
    
   
   The null byte (\x00) works fine as a separator for VARCHAR. However, it 
cannot be used as a separator byte for VARBINARY because VARBINARY can contain 
any binary blob values. Due to this, Phoenix has restrictions for VARBINARY 
type: 
   
    
   
   It can only be used as the last part of the composite primary key.
   It cannot be used as a DESC order primary key column.
    
   
   Using VARBINARY data type as an earlier portion of the composite primary key 
is a valid use case. One can also use multiple VARBINARY primary key columns. 
After all, Phoenix provides the ability to use multiple primary key columns for 
users.
   
   Besides, using secondary index on data table means that the composite 
primary key of secondary index table includes: 
   
   ```
   <secondary-index-col1> <secondary-index-col2> … <secondary-index-colN> 
<primary-key-col1> <primary-key-col2> … <primary-key-colN>
   ```
   
   As primary key columns are appended to the secondary indexes columns, one 
cannot create a secondary index on any VARBINARY column.
   
   The proposal of this Jira is to introduce new data type VARBINARY_ENCODED, 
which has no restriction of being considered as composite primary key prefix or 
using it as DESC ordered column.
   
   This means, we need to effectively distinguish where the variable length 
binary data terminates in the absence of fixed length information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to