Hi Victoria, Thanks for the case sharing! Below are just some my cents:
Solution A is okay in this case, as the cardinality of "ab..." for each "H"/"M"/"L" is acceptable for a dictionary; But as we know dictionary isn't suitable for ultra high cardinality (default cap is 5million); this solution may have bottleneck when data increases; Solution B looks better in my opinion, it worth a try. You can even combine more dimensions like 4 or 5 into 1; Solution C is similar as B; as bitwise operation isn't supported, you need convert that to a value set and then use IN clause in app level, then it will be almost the same as B, because Kylin will convert "LIKE" to "IN" with the values in dictionary; 2016-06-20 19:45 GMT+08:00 Victoria Tskhay <[email protected]> : > Hello, > > Thank you for the quick reply! Good to know. > > As for the suggestions: > > 1) use derived columns -> unfortunately the logic of derived dimensions > doesn't match our usecase > 2) use multiple cubes -> as well, unfortunately we need to have all > dimensions interconnected in one cube > > We are currently evaluating the following approaches to deal with our > situation: > > *A.* Use *LIKE* in a naive way: > > E.g. we have 60 columns named 'a', 'b', 'c', etc. Each one can have one of > these three values: 'L', 'M', 'H' (+ NULL). So what we did is "rotate" the > whole thing in such a way that the values become column names and the > column names turn into values, all concatenated into one string(and sorted > before concat). Thus, the column names are 'L', 'M' and 'H' and values > would be like: 'acde', 'a', 'h', 'ace', 'dg', etc. > > Now, when we want to get a count of *WHERE a = 'M' OR b = 'L'**OR c = > 'L'*, we do *WHERE M LIKE '%a%' OR L LIKE '%b%'**OR L LIKE '%c%'*. > > It looks very expensive but surprisingly for our amount of data (2 billion > rows 6 million out of which have any values in those 60 columns) currently > it works not so bad (response time p99 < 1s). > > *B.* *Group several *dimensions (e.g. 3)*into one* and use something like > a *bitmap*. > > Again, e.g. we have 60 columns named 'a', 'b', 'c', etc. Each one can have > one of these three values: 'L', 'M', 'H' (+ NULL). So we wanna group them > into, say, columns named 'abc', 'def', 'ghi', etc. Then the values would be > like 'L__', 'LM_', '_HL', 'HLM', etc. (The underscore means NULL.) Thus, we > can imitate *WHERE a = 'M' OR b = 'L'**OR c = 'L'* with *WHERE abc LIKE > 'M__' OR abc LIKE '_L_'**OR c LIKE '__L'*. Presumably such LIKE expressions > with single-character wildcards will be easier for Kylin to handle than > those with '%'. We haven't tried that yet though. > > *C.* Similar to the previous one but *replacing values with integers* and > applying *bitwise operations with bitmasks*. E.g. L would be "encoded" with > 01, M - 10, H - 11. Then a string 'HL_' would look like 110100 -> 52. Then > when we want to get an 'L' in the middle ('_L_'), we could do something > like *WHERE abc & 001100 = 01*. BUT it looks like bitwise operations are > not supported neither by ANSI SQL nor by Kylin. And if they were, we are > not sure it would be faster than a dictionary. > > Would you please provide us with some feedback on what you think about the > approaches described above and which one you think would actually work/be > the best? > > Thank you! > > > Best regards > Victoria > > > On 18.06.2016 04:07, ShaoFeng Shi wrote: > >> Almost true; You can think Kylin is 64 bit, in theory it supports up to 63 >> dimension in one cube; >> >> There is no plan to extend to 128 or more in near term I believe; Since in >> most of the cases the dimension number wouldn't exceed 20, 64 is already >> "redundant" and causing extra space; >> >> With so many dimensions, there must be room for optimization; You can try >> some ways like: >> 1) extract some columns to lookup tables, and create them as "derived" >> dimension in the cube; >> 2) or create multiple cubes, each serving a part of these columns; >> >> If you have other way, please also share with the community; Thanks; >> >> >> 2016-06-18 0:01 GMT+08:00 Victoria Tskhay < >> [email protected]>: >> >> Hello, >>> >>> It looks like the max number of dimensions in one cube is 62, is that >>> correct? >>> >>> We would like to add more than that. That may sound crazy, I know, but we >>> have a special case where all the dimensions have low cardinality (3) and >>> the data is very sparse. We already tried with 62 dimensions and it works >>> great. >>> >>> Is there any way to work around that limit? What would you suggest? Thank >>> you! >>> >>> >>> >>> Best regards >>> -- >>> Victoria Tskhay >>> >>> *Java Backend Developer*I glispa GmbH >>> >>> Sonnenburger Straße 73, 10437 Berlin, Germany >>> E [email protected] <// >>> e.mail.ru/compose/?mailto=mailto%[email protected]> >>> Skype: vikatskhay I www.glispa.com <http://www.glispa.com> >>> >>> Sitz Berlin, AG Charlottenburg HRB 114678B >>> >>> >> >> > -- > Victoria Tskhay > > *Java Backend Developer*I glispa GmbH > > Sonnenburger Straße 73, 10437 Berlin, Germany > E [email protected] <// > e.mail.ru/compose/?mailto=mailto%[email protected]> > Skype: vikatskhay I www.glispa.com <http://www.glispa.com> > > Sitz Berlin, AG Charlottenburg HRB 114678B > -- Best regards, Shaofeng Shi
