That’s great. I’ll working on bitmap counter implemention and cube buidling in this week, and merge your changing on query logic next week.
> 在 2015年12月17日,17:48,Li Yang <liy...@apache.org> 写道: > > @Sun Yerui, I looked the query parsing part again, it's possible to delay > the aggregation mapping to be after cube selection. And then the type info > on the cube can supplement the mapping. It requires some refactoring > effort, but won't affect the MeasureType interface. You can proceed > implementation at your side while I work on this change. > > On Fri, Dec 11, 2015 at 11:36 AM, Li Yang <liy...@apache.org> wrote: > >> I can see the need from user perspective. Let me look again at the query >> parsing logic and see if any tweak is possible. >> >> On Fri, Dec 11, 2015 at 7:59 AM, Luke Han <luke...@gmail.com> wrote: >> >>> It should transparent to users, they should always use "count(distinct >>> seller_id)" >>> >>> How about one setting value when user pickup "DistinctCount"? We already >>> have error range, it should be easy to have one more option say "Precise" >>> (but yes, also have to display warn message about the disadvantage for >>> this). Then in code level, it could be easy to handle like Yerui >>> mentioned. >>> >>> Thanks. >>> >>> >>> >>> >>> Best Regards! >>> --------------------- >>> >>> Luke Han >>> >>> On Thu, Dec 10, 2015 at 7:33 PM, Yerui Sun <sunye...@gmail.com> wrote: >>> >>>> You’re right, I ignored that can’t get return type from query context. >>>> >>>> I’m not familiar with Calcite UDF, do you mean a new sql writing like >>>> “count (distinct_precise seller_id)”? That’s not transparent for user, >>>> seems not the best way. >>>> >>>> Another way is still mapping count distinct query to one aggr func, and >>>> making the func could handle variety of ValueType. For example, >>> abstracting >>>> a count distinct measure type called ‘CountDistinctMeasureType’, as >>> parent >>>> of HLLCMeasureType and BitmapMeasureType, and mapping all count distinct >>>> query to ‘CountDistinctAggFunc’, with abstract class >>> ‘CountDistinctCounter’ >>>> as add() and merge() parameter type. When this aggr func was called, the >>>> processing depends on the value type, like HLLCounter or BitmapCounter. >>>> >>>> I’not sure whether I’ve described it clear. Actually I have implemented >>>> bitmap count distinct in 1.x-staging by this way, keeping hll count >>>> distinct still working. Maybe I could implement it in 2.x-staging with >>> your >>>> refactoring, and we could review the code later? >>>> >>>>> 在 2015年12月10日,18:23,Li Yang <liy...@apache.org> 写道: >>>>> >>>>> I've considered exactly the same point. It does not work when mapping >>> a >>>>> query to the aggregation functions. A query will simply say "count >>>>> (distinct seller_id)", and won't mention any return type. >>>>> >>>>> The way out is adding a new aggregation for your count distinct using >>>>> Calcite UDF, then it can be correctly mapped. I don't have an example >>>> yet, >>>>> so we need do some exploration here. Actually I hope to use your case >>> as >>>> an >>>>> example. :-) >>>>> >>>>> >>>>> >>>>> On Thu, Dec 10, 2015 at 4:25 PM, Yerui Sun <sunye...@gmail.com> >>> wrote: >>>>> >>>>>> It’s really great job, Yang! >>>>>> >>>>>> I have a question about the MeasureTypeFactory. In the current >>>> 2.x-stating >>>>>> code, two built-in measure types (hll and topn) were registered, and >>> the >>>>>> factory create the corresponding MeasureType only by funcName >>>>>> (‘COUNT_DISTINCT’ for hll and ‘TOP_N’ for topn). >>>>>> However, if I want to create a new measure type with same funcName, >>>> that’s >>>>>> impossible. For example, I want to create bitmap measure by funcName >>>>>> ‘COUNT_DISTINCT’, same as hll measure's funcName. >>>>>> >>>>>> One possible way is that factory create measure type not only rely on >>>>>> funcName, but also returnType, making one funcName to multi measure >>> is >>>>>> possible. >>>>>> In another word, we could define the measure type in factory using >>>>>> funcName and returnType, instead of only funcName for now. >>>>>> >>>>>> Do you think this make sense? Looking for your comment. >>>>>> >>>>>>> 在 2015年12月10日,14:57,Li Yang <liy...@apache.org> 写道: >>>>>>> >>>>>>>> Would it be possible to create a How to guide on ability to add >>> custom >>>>>> aggregates >>>>>>> into Kylin >>>>>>> >>>>>>> Definitely! I should spent some time on documentation in the >>> following >>>>>>> days. Many features have been added to 2.x. Aiming to release a 2.0 >>>> beta >>>>>>> soon, it's time to work on document. :-) >>>>>>> >>>>>>>> Where are the custom aggregates computed on the Kylin Service or on >>>>>> Hbase >>>>>>> CoProcessors? >>>>>>> >>>>>>> The aggregation takes place in MR during cube build, then in >>>> CoProcessor >>>>>>> and query service during query. Originally I hoped user can add new >>>>>>> aggregation by just dropping a jar ball and some configuration. >>> However >>>>>> it >>>>>>> turns out to be more than that due to CoProcessor... Anyway, it's a >>> lot >>>>>>> more friendly to developers now. >>>>>>> >>>>>>> On Thu, Dec 10, 2015 at 2:14 PM, hongbin ma <mahong...@apache.org> >>>>>> wrote: >>>>>>> >>>>>>>> hi seshu >>>>>>>> >>>>>>>> yang's work is more of a framework. it reduces developers' efforts >>> if >>>>>>>> he/she wants to add a new custom aggregations. Since some of the >>>>>>>> aggregations happens in coprocessors, we cannot completely get rid >>> of >>>>>>>> re-compiling & re-deploying. If someone from the community is >>>>>> interested in >>>>>>>> crafting a new aggregation, he/she can take a look at how HLL/TOPN >>>>>>>> aggregation is implemented. >>>>>>>> >>>>>>>> On Wed, Dec 9, 2015 at 9:43 PM, Adunuthula, Seshu < >>>> sadunuth...@ebay.com >>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yang, >>>>>>>>> >>>>>>>>> Would it be possible to create a How to guide on ability to add >>>> custom >>>>>>>>> aggregates into Kylin. Javadocs are good, but to encourage >>> community >>>>>>>>> participation we should make it more easily consumable. >>>>>>>>> >>>>>>>>> Where are the custom aggregates computed on the Kylin Service or >>> on >>>>>> Hbase >>>>>>>>> CoProcessors? >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Seshu Adunuthula. >>>>>>>>> >>>>>>>>> On 12/8/15, 6:18 AM, "Adunuthula, Seshu" <sadunuth...@ebay.com> >>>> wrote: >>>>>>>>> >>>>>>>>>> This is awesome! >>>>>>>>>> >>>>>>>>>> On 12/8/15, 6:05 AM, "Shi, Shaofeng" <shao...@ebay.com> wrote: >>>>>>>>>> >>>>>>>>>>> This is another important refactor since making the build/query >>>>>> engines >>>>>>>>>>> as >>>>>>>>>>> plugable. Thanks Yang! >>>>>>>>>>> >>>>>>>>>>> On 12/8/15, 5:47 PM, "Li Yang" <liy...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> This is a bump of KYLIN-976 in case you are not yet aware... >>>>>>>>>>>> >>>>>>>>>>>> KYLIN-976 is a refactoring of how Kylin works with aggregation >>> and >>>>>>>> aims >>>>>>>>>>>> to >>>>>>>>>>>> allow adding custom aggregation types easily. >>>>>>>>>>>> >>>>>>>>>>>> Kylin started with basic support of SUM, COUNT, MAX, MIN, AVG >>>> (from >>>>>>>> sum >>>>>>>>>>>> and >>>>>>>>>>>> count), and COUNT_DISTINCT (based on hyperloglog). Later TopN >>> is >>>>>> added >>>>>>>>>>>> in >>>>>>>>>>>> 2.x branch. And the list is growing for sure. Xiaoyu is >>> working on >>>>>>>>>>>> storing >>>>>>>>>>>> raw records as a special type of measure (KYLIN-1122), also >>> Yerui >>>> is >>>>>>>>>>>> working on precise count distinct using bitmap (KYLIN-1186). >>>>>>>>>>>> >>>>>>>>>>>> The possibility is unlimited. Implement a domain specific >>>>>> aggregation >>>>>>>> is >>>>>>>>>>>> now quite easy. E.g. aggregate user events to detect time >>> serials >>>> or >>>>>>>>>>>> access >>>>>>>>>>>> patterns. Or draw a sketch of certain user groups. Or >>>> pre-calculate >>>>>>>>>>>> clusters of data points. Or histogram... Use your imagination. >>>>>>>>>>>> >>>>>>>>>>>> Whoever interested can peek at MeasureTypeFactory and >>> MeasureType >>>> on >>>>>>>> 2.x >>>>>>>>>>>> branch. The API may still change, but at the same time is >>> stable >>>>>>>> enough >>>>>>>>>>>> for >>>>>>>>>>>> pilots. The javadoc should get you started. HLLCMeasureType and >>>>>>>>>>>> TopNMeasureType are two good examples. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Cheers >>>>>>>>>>>> Yang >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> >>>>>>>> *Bin Mahone | 马洪宾* >>>>>>>> Apache Kylin: http://kylin.io >>>>>>>> Github: https://github.com/binmahone >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >> >>