[ 
https://issues.apache.org/jira/browse/PHOENIX-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802372#comment-15802372
 ] 

Mujtaba Chohan commented on PHOENIX-3559:
-----------------------------------------

Sure [~jamestaylor] agreed. I see this is not optimized for sparse columns but 
for one of our internal use case which is based on schema driven by customers, 
encoded columns could potentially be used in this way so at least it's good to 
know the limits and breakeven point.

I also tested with slightly longer column names as column_1 ...column_5000 and 
the comparative data sizes were the same which might be due to FAST_DIFF 
encoding that we have on by default.

Thanks [[email protected]] for those data points.

> More disk space used with encoded column scheme with data in sparse columns
> ---------------------------------------------------------------------------
>
>                 Key: PHOENIX-3559
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3559
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>             Fix For: 4.10.0
>
>
> Schema with 5K columns
> {noformat}
> create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT 
> PK PRIMARY KEY (K1, K2)) 
> VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=true
> {noformat}
> In this schema, only 100 random columns are filled with random 15 chars. Rest 
> are nulls.
> Data size is *6X* larger with encoded columns scheme compare to non-encoded. 
> That is 12GB/1M rows encoded vs ~2GB/1M rows non-encoded.
> When compressed GZ, size with encoded column scheme is still 35% higher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to