[ https://issues.apache.org/jira/browse/KYLIN-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shaofeng SHI updated KYLIN-2243: -------------------------------- Description: TopNCounterSerializer.maxLength() and TopNCounterSerializer.getStorageBytesEstimate() might be inaccurate, especially when there are multiple "group by" columns in one TopN measure and some uses long bytes encoding like "fixed_length:16" The inaccurate estimation may cause memory issue when using in-mem cubing, and will cause the estimation on final cube size inaccurate. The root cause is the data type like "top(100)" doesn't have the info of how long a key can be. So far it uses a default value 4 which is too small when the encoding is something like "fixed_length:16". The solution is extending the expression of data type to "top(100, 16)" to indicate that one key can be 16 bytes long. If the "scale" is absent, use 4 bytes as default. was: TopNCounterSerializer.maxLength() and TopNCounterSerializer.getStorageBytesEstimate() might be inaccurate, especially when there are multiple "group by" columns in one TopN measure and some uses long bytes encoding like "fixed_length:16" The inaccurate estimation may cause memory issue when using in-mem cubing, and will cause the estimation on final cube size inaccurate. The root cause is the data type like "top(100)" doesn't have the info of how long a key can be. So far it uses a default value 4 which is too small when the encoding is something like "fixed_length:16". The solution is extending the expression of data type to "top(100, 16)" to indicate that one key can be 16 bytes long. If the "scale" is absent, use 6 bytes as default. > TopN memory estimation is inaccurate in some cases > -------------------------------------------------- > > Key: KYLIN-2243 > URL: https://issues.apache.org/jira/browse/KYLIN-2243 > Project: Kylin > Issue Type: Bug > Reporter: Shaofeng SHI > Assignee: Shaofeng SHI > Fix For: v2.0.0 > > > TopNCounterSerializer.maxLength() and > TopNCounterSerializer.getStorageBytesEstimate() might be inaccurate, > especially when there are multiple "group by" columns in one TopN measure and > some uses long bytes encoding like "fixed_length:16" > The inaccurate estimation may cause memory issue when using in-mem cubing, > and will cause the estimation on final cube size inaccurate. > The root cause is the data type like "top(100)" doesn't have the info of how > long a key can be. So far it uses a default value 4 which is too small when > the encoding is something like "fixed_length:16". The solution is extending > the expression of data type to "top(100, 16)" to indicate that one key can be > 16 bytes long. If the "scale" is absent, use 4 bytes as default. -- This message was sent by Atlassian JIRA (v6.3.15#6346)