If you can figure out a good mapping between date/string to int/long, then the bitmap is a good solution. E.g. date maps to integer very well.
Expect community will have more contributions in this area. On Friday, January 29, 2016, Yerui Sun <[email protected]> wrote: > Thanks shaofeng and hongbin for your explaining. > > Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some > thinking about the designing: > > We indeed just support Int type for now, and cast Long to Int may cause > precision losing (shaofeng removed the casting and I agreed that), the > reason mainly is Int has been enough for most cases. > > I thought about to support all types, including String or Date, and the > conclusion is that’s difficult. One solution is store all the values, > that’s appear too costly, and another solution is finding the *precisely* > projecting from string to int, for example dict ( not hash, because the > projecting maybe conflicting). > However, the dict generating is still difficult, especially when the > cardinality is very high. I think KYLIN-1122 facing the same problem, so > let’s see what’s the solution in KYLIN-1122, maybe we could borrow > something. > > The reason of casting Long to Int is that bitmap based on RoaringBitmap, > which maintained by lemire([email protected] <javascript:;>), just > supporting Integer. Expanding it to Long is kind of complicated, so I > skipped that for now. > > Overall, this feature just fitted the common user case, and has absolutely > room for improvement. Please let me know if you have any idea, and any > comment is welcome. > > > > > 在 2016年1月28日,22:33,ShaoFeng Shi <[email protected] <javascript:;>> > 写道: > > > > I removed the code for long type in BitmapCounter as the casting will get > > things wrong (but the target is to provide accurate value); @Yerui, for > you > > awareness; once we find the solution for long, then add it back. > > > > 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <[email protected] > <javascript:;>>: > > > >> what's the cardinality of the dimension that you want to count distinct > >> values? Integer's range is enough for most cases, if your case is under > >> this scope, you can try the bitmap with integer; but you need map the > value > >> to an unique id and use that within the bitmap. For example, if you > want to > >> count distinct users, use the numeric user_id, instead of email > address; To > >> support other data types, as Hongbin mentioned, the storage cost is very > >> high, we don't have that plan. > >> > >> > >> > >> > >> > >> 2016-01-28 20:54 GMT+08:00 hongbin ma <[email protected] > <javascript:;>>: > >> > >>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is not a > >>> mature feature yet and it only supports integer > >>> we don't yet have plans to support any other forms of precise distinct > >>> count, as it is too expensive to pre-calculate > >>> > >>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <[email protected] > <javascript:;>> > >>> wrote: > >>> > >>>> Thanks ShaoFeng Shi, > >>>> > >>>> We might need for other data types as well > >>>> > >>>> date & string > >>>> > >>>> (eg, distinct count of dates of certain activity) > >>>> > >>>> So in the rest call instead of hllc return type it should be bitmap > for > >>>> int,tinyint etc ? > >>>> > >>>> And we still send it as hllc for other data types ? > >>>> > >>>> > >>>> Also in one of the comments, it said we cast long to int.. wont we be > >>>> losing data due to truncation ? > >>>> > >>>> > >>>> Regards, > >>>> Abhilash > >>>> > >>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <[email protected] > <javascript:;>> > >>>> wrote: > >>>> > >>>>> is this matched your case? > >>>>> https://issues.apache.org/jira/browse/KYLIN-1186 > >>>>> > >>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <[email protected] > <javascript:;>>: > >>>>> > >>>>>> +user ml > >>>>>> > >>>>>> Regards, > >>>>>> Abhilash > >>>>>> > >>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L < > >>> [email protected] <javascript:;>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> Is there a way to ask Kylin to get exact distinct count ? From > >>>> what > >>>>>> we > >>>>>>> understand, we can choose between hllc(10) to hllc(16) > >>>>>>> > >>>>>>> I understand that for every cuboid, you will need to go through > >>>> the > >>>>>>> whole data set again, but with the new cubing algo (2.x branch) > >>>> should > >>>>> be > >>>>>>> simpler to add ? > >>>>>>> > >>>>>>> If currently not present are there any plans to introduce this > >>> ? > >>>>>>> > >>>>>>> Regards, > >>>>>>> Abhilash > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Best regards, > >>>>> > >>>>> Shaofeng Shi > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Regards, > >>> > >>> *Bin Mahone | 马洪宾* > >>> Apache Kylin: http://kylin.io > >>> Github: https://github.com/binmahone > >>> > >> > >> > >> > >> -- > >> Best regards, > >> > >> Shaofeng Shi > >> > >> > > > > > > -- > > Best regards, > > > > Shaofeng Shi > >
