[jira] [Commented] (PHOENIX-3559) More disk space used with encoded column scheme with data in sparse columns

James Taylor (JIRA) Tue, 03 Jan 2017 17:08:31 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796772#comment-15796772
 ]


James Taylor commented on PHOENIX-3559:
---------------------------------------

The encoding scheme isn't optimized for sparse storage. The idea would be to 
use it if your storage is dense. Potentially you could use the column encoding 
scheme but still use multiple key values which would be a good choice for 
sparse data. You'd want to use realistic column names for a test like this 
(instead of c1, c2, c3) as that's where you'd get some space savings. It'd be 
good to determine where the break even point is in terms of sparseness.

We could potentially improve our new storage format for sparse storage, but I'm 
not sure we'll find one optimum format for both dense and sparse storage. 
Enabling new storage formats to be defined will be valuable for this reason.

> More disk space used with encoded column scheme with data in sparse columns
> ---------------------------------------------------------------------------
>
>                 Key: PHOENIX-3559
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3559
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>             Fix For: 4.10.0
>
>
> Schema with 5K columns
> {noformat}
> create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT 
> PK PRIMARY KEY (K1, K2)) 
> VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=true
> {noformat}
> In this schema, only 100 random columns are filled with random 15 chars. Rest 
> are nulls.
> Data size is *6X* larger with encoded columns scheme compare to non-encoded. 
> That is 12GB/1M rows encoded vs ~2GB/1M rows non-encoded.
> When compressed GZ, size with encoded column scheme is still 35% higher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-3559) More disk space used with encoded column scheme with data in sparse columns

Reply via email to