[
https://issues.apache.org/jira/browse/PHOENIX-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796772#comment-15796772
]
James Taylor commented on PHOENIX-3559:
---------------------------------------
The encoding scheme isn't optimized for sparse storage. The idea would be to
use it if your storage is dense. Potentially you could use the column encoding
scheme but still use multiple key values which would be a good choice for
sparse data. You'd want to use realistic column names for a test like this
(instead of c1, c2, c3) as that's where you'd get some space savings. It'd be
good to determine where the break even point is in terms of sparseness.
We could potentially improve our new storage format for sparse storage, but I'm
not sure we'll find one optimum format for both dense and sparse storage.
Enabling new storage formats to be defined will be valuable for this reason.
> More disk space used with encoded column scheme with data in sparse columns
> ---------------------------------------------------------------------------
>
> Key: PHOENIX-3559
> URL: https://issues.apache.org/jira/browse/PHOENIX-3559
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: Mujtaba Chohan
> Assignee: Samarth Jain
> Fix For: 4.10.0
>
>
> Schema with 5K columns
> {noformat}
> create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT
> PK PRIMARY KEY (K1, K2))
> VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=true
> {noformat}
> In this schema, only 100 random columns are filled with random 15 chars. Rest
> are nulls.
> Data size is *6X* larger with encoded columns scheme compare to non-encoded.
> That is 12GB/1M rows encoded vs ~2GB/1M rows non-encoded.
> When compressed GZ, size with encoded column scheme is still 35% higher.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)