If you can figure out a good mapping between date/string to int/long, then
the bitmap is a good solution. E.g. date maps to integer very well.

Expect community will have more contributions in this area.


On Friday, January 29, 2016, Yerui Sun <[email protected]> wrote:

> Thanks shaofeng and hongbin for your explaining.
>
> Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some
> thinking about the designing:
>
> We indeed just support Int type for now, and cast Long to Int may cause
> precision losing (shaofeng removed the casting and I agreed that), the
> reason mainly is Int has been enough for most cases.
>
> I thought about to support all types, including String or Date, and the
> conclusion is that’s difficult. One solution is store all the values,
> that’s appear too costly, and another solution is finding the *precisely*
> projecting from string to int, for example dict ( not hash, because the
> projecting maybe conflicting).
> However, the dict generating is still difficult, especially when the
> cardinality is very high. I think KYLIN-1122 facing the same problem, so
> let’s see what’s the solution in KYLIN-1122, maybe we could borrow
> something.
>
> The reason of casting Long to Int is that bitmap based on RoaringBitmap,
> which maintained by lemire([email protected] <javascript:;>), just
> supporting Integer. Expanding it to Long is kind of complicated, so I
> skipped that for now.
>
> Overall, this feature just fitted the common user case, and has absolutely
> room for improvement. Please let me know if you have any idea, and any
> comment is welcome.
>
>
>
> > 在 2016年1月28日,22:33,ShaoFeng Shi <[email protected] <javascript:;>>
> 写道:
> >
> > I removed the code for long type in BitmapCounter as the casting will get
> > things wrong (but the target is to provide accurate value); @Yerui, for
> you
> > awareness; once we find the solution for long, then add it back.
> >
> > 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <[email protected]
> <javascript:;>>:
> >
> >> what's the cardinality of the dimension that you want to count distinct
> >> values? Integer's range is enough for most cases, if your case is under
> >> this scope, you can try the bitmap with integer; but you need map the
> value
> >> to an unique id and use that within the bitmap. For example, if you
> want to
> >> count distinct users, use the numeric user_id, instead of email
> address; To
> >> support other data types, as Hongbin mentioned, the storage cost is very
> >> high, we don't have that plan.
> >>
> >>
> >>
> >>
> >>
> >> 2016-01-28 20:54 GMT+08:00 hongbin ma <[email protected]
> <javascript:;>>:
> >>
> >>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is not a
> >>> mature feature yet and it only supports integer
> >>> we don't yet have plans to support any other forms of precise distinct
> >>> count, as it is too expensive to pre-calculate
> >>>
> >>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <[email protected]
> <javascript:;>>
> >>> wrote:
> >>>
> >>>> Thanks ShaoFeng Shi,
> >>>>
> >>>> We might need for other data types as well
> >>>>
> >>>> date & string
> >>>>
> >>>> (eg, distinct count of dates of certain activity)
> >>>>
> >>>> So in the rest call instead of hllc return type it should be bitmap
> for
> >>>> int,tinyint etc ?
> >>>>
> >>>> And we still send it as hllc for other data types ?
> >>>>
> >>>>
> >>>> Also in one of the comments, it said we cast long to int..  wont we be
> >>>> losing data due to truncation ?
> >>>>
> >>>>
> >>>> Regards,
> >>>> Abhilash
> >>>>
> >>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <[email protected]
> <javascript:;>>
> >>>> wrote:
> >>>>
> >>>>> is this matched your case?
> >>>>> https://issues.apache.org/jira/browse/KYLIN-1186
> >>>>>
> >>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <[email protected]
> <javascript:;>>:
> >>>>>
> >>>>>> +user ml
> >>>>>>
> >>>>>> Regards,
> >>>>>> Abhilash
> >>>>>>
> >>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
> >>> [email protected] <javascript:;>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>>   Is there a way to ask Kylin to get exact distinct count ?  From
> >>>> what
> >>>>>> we
> >>>>>>> understand, we can choose between hllc(10) to hllc(16)
> >>>>>>>
> >>>>>>>   I understand that for every cuboid, you will need to go through
> >>>> the
> >>>>>>> whole data set again, but with the new cubing algo (2.x branch)
> >>>> should
> >>>>> be
> >>>>>>> simpler to add ?
> >>>>>>>
> >>>>>>>   If currently not present are there any plans to introduce this
> >>> ?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Abhilash
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>>
> >>>>> Shaofeng Shi
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> *Bin Mahone | 马洪宾*
> >>> Apache Kylin: http://kylin.io
> >>> Github: https://github.com/binmahone
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >> Shaofeng Shi
> >>
> >>
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
>
>

Reply via email to