[
https://issues.apache.org/jira/browse/PHOENIX-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15800011#comment-15800011
]
Enis Soztutar commented on PHOENIX-2565:
----------------------------------------
Thanks, I was checking the code in the branch before 01ef5d.
Here:
https://github.com/apache/phoenix/blob/encodecolumns2/phoenix-core/src/main/java/org/apache/phoenix/schema/types/PArrayDataType.java#L1299
we are still serializing the nulls, no?
bq. For example, if column 1 is set and column 102 is set, we're storing
offsets for column2 through column 101. We could instead introduce a bit set
that tracks if a value is set
For doing nulls in Avro, you do a union of the type with the Null type, so all
nullable fields are encoded like {{<is_null:1byte><type_data:0 or more
bytes>}}. So avro has to spend 1 byte per nullable field, regardless of whether
the field is there or not. PB has a different model, where each type is
prefixed with the id of the field, which also means that if the field is not
there it is null. So, the cost is 1 varint per field that is not-null (as
opposed to per field in the schema). Obviously what is optimal depends on
average whether there is a lot of null-fields in the data or not.
The cost of doing a bitset for nullability fields would be 1 byte per 8
"declared" fields (regardless of whether there is null or not). If there is a
single null field, we are saving 2 or 4 bytes (for the offset). So if on
average, we expect the data to have at least 1 null per 16 columns or so it
looks like a good idea to implement this.
> Store data for immutable tables in single KeyValue
> --------------------------------------------------
>
> Key: PHOENIX-2565
> URL: https://issues.apache.org/jira/browse/PHOENIX-2565
> Project: Phoenix
> Issue Type: Improvement
> Reporter: James Taylor
> Assignee: Thomas D'Silva
> Attachments: PHOENIX-2565-v2.patch, PHOENIX-2565-wip.patch,
> PHOENIX-2565.patch
>
>
> Since an immutable table (i.e. declared with IMMUTABLE_ROWS=true) will never
> update a column value, it'd be more efficient to store all column values for
> a row in a single KeyValue. We could use the existing format we have for
> variable length arrays.
> For backward compatibility, we'd need to support the current mechanism. Also,
> you'd no longer be allowed to transition an existing table to/from being
> immutable. I think the best approach would be to introduce a new IMMUTABLE
> keyword and use it like this:
> {code}
> CREATE IMMUTABLE TABLE ...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)