Thank you. That sounds very good. Best regards Jan
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of K. John Wu Sent: Monday, October 12, 2009 1:00 AM To: FastBit Users Subject: Re: [FastBit-users] Fastbit aggregate functions? Hi, Jan, Thanks the updated patch. We are in the process of implementing a feature slated for the coming release. This patch shall be included. It might take a week or two for us to get all the necessary pieces together. If you have any question about your patch, we will get back to you. John On 10/10/2009 12:32 PM, Jan Steemann wrote: > Hi John, > > please find attached a patch that implements additional aggregate functions > for Fastbit. > The patch adds the following functions: > - stdpop: standard population deviation > - stdsamp: standard sample deviation > - varpop: standard population variance > - varsamp: standard sample variance > > These 4 functions share the same code as they are very similar: the variances > are just square roots of the deviations, and population and sample variance > only differ in terms of the denominator used. > > I have also added two other aggregate functions: > - median: calculate the center value of the group > - distinct: count the number of distinct elements in a group > > Both median and distinct are done by sorting the group values first and then > picking out the value/values in the middle of the sorted set (median) or by > iterating over them and incrementing a counter if the current value is > different to the last value found (distinct). > > So the distinct version present is actually like a count(distinct(column > name)) but I did not want to change the syntax too much and opted for a > simple approach without nested functions. > > None of the above functions is used for anything else than calculating > aggregates. They should have no effect on any other part of the query > processing. > > > The diff I attached affect the lexer and parser for the greatest part. I have > attached the yy and ll source files plus the files generated by Bison and > Flex. > As I could not think of any proper function names for the above functions > with just 3 characters length, I also removed the 3 character restriction for > aggregate function names in Fastbit. > > The logic for all of the functions above is contained in src/colValues.cpp. > For each column type, I have extended the switch statement in the reduce() > function. > > The median and distinct functions currently have extra memory overhead. This > is because they need to sort the group values first in order to work. As the > groups are passed into the function as consts, I have kind of worked around > by putting all values of a group into a vector, then sorting this vector, and > finally working on the sorted data in the vector. > This is probably ugly and definitely not as efficient as it could be. Please > note that I also relied on std::sort() for sorting and am not using any of > the already existing sort implementations such as ibis::colDouble::sort(). > The reason for this is that I was not sure what they could be used for > exactly. > That means there is still room for efficiency improvements for at least > median and distinct. > > > I have checked the functions results with ibis for columns of mostly type > integer. However, if I remember correctly then always > ibis::colDoubles::reduce() was called, and not ibis::colInts::reduce() for > example. This made sense to me because returning the standard deviation as an > integer value would lead to loss of precision. > Anyway, I haven't seen to other reduce() implementations being called so I am > not 100 % sure whether there are typecasting/rounding issues with the results. > > > Please feel free to do whatever you prefer with the patch. > Of course I would like to see the functionality of the patch be moved into > the official Fastbit, but I am not aware if it actually fits in your > development roadmap and goals. > Furthermore, I am not sure if there are any issues with the code that I am > not aware of or if I took a totally wrong approach with using the vectors. > > If you should have any questions/comments on the patch please let me know. > > Thank you and best regards > Jan > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Jan Steemann > Sent: Monday, October 05, 2009 5:40 PM > To: FastBit Users > Subject: Re: [FastBit-users] Fastbit aggregate functions? > > Hi John, > > yes, I will try to clean it up as much as I can and then resend it. > It will probably take a week or so. > > Best regards > Jan > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of K. John Wu > Sent: Monday, October 05, 2009 5:29 PM > To: FastBit Users > Subject: Re: [FastBit-users] Fastbit aggregate functions? > > Hi, Jan, > > Thanks for the offering to help. If you are willing to put in more > effort, we would really appreciate it. > > John > > > > On 10/5/2009 8:23 AM, Jan Steemann wrote: >> Hi John, >> >> thanks. >> I think my code still has some issues. I suggest I clean it up first and >> then resend the patch to save you unnecessary work. >> Is that ok? >> >> Best regards >> Jan >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of K. John Wu >> Sent: Monday, October 05, 2009 5:14 PM >> To: FastBit Users >> Subject: Re: [FastBit-users] Fastbit aggregate functions? >> >> Hi, Jan, >> >> Thanks for the patch, we will look into integrate it into out code >> shortly and will let you know as soon as we are done with that. >> >> John >> >> >> On 10/3/2009 5:53 PM, Jan Steemann wrote: >>> Hi John, >>> >>> I have played a little bit with the Fastbit 1.1.3 source code and have >>> added basic support for a variance aggregate function. The function is >>> named VAR() when invoked from queries. >>> >>> I have attached a diff with my changes to this email. >>> >>> Most of the changes have been done in src/colValues.cpp, where are the >>> actual values are calculated. >>> I also had to change the parser & lexer so they became aware of the new >>> function. I have also touched a few other files and added the new function >>> there. >>> >>> My built may have changed a few config or auto-generated build files >>> unintentionally, maybe there are also configuration issues because I >>> regenerated files with bison & flex. Please excuse any issues. >>> >>> Furthermore, my implementation for VAR() is definitely not optimal in terms >>> of code & performance. It should be considered alpha quality only. I have >>> neither tested it with edge cases, other column types, nor measured any >>> performance impacts it may have for any existing queries. >>> It was more or less intended as a try of how easy it would be to add >>> functionality to the SQL interface. >>> >>> I think it must be cleaned up & tested before it should actually be added >>> to the official source. However, I'd like to share the changeset if >>> somebody else on the list is interested and wants to take it further. >>> >>> Please feel free to get back to me in case you should have any further >>> questions. >>> >>> Best regards >>> Jan >>> >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of Jan Steemann >>> Sent: Saturday, October 03, 2009 11:12 PM >>> To: FastBit Users >>> Subject: Re: [FastBit-users] Fastbit aggregate functions? >>> >>> Hi John, >>> >>> thanks for getting back to me and for your suggestions. >>> >>> A few comments back: >>> >>> 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, >>> however, I was particularly interested in distinct values. >>> The application area is actually not that scientific. It's about analyzing >>> web log files and there finding out which and how many actions individuals >>> did. More precisely, it's about how many individuals did start specific >>> events. Some example data follows: >>> >>> individual_id,event_id,timestamp >>> 1,100,... /* ind 1 started evnt 100 */ >>> 1,101,... /* ind 1 started evnt 101 */ >>> 2,100,... /* ind 2 started evnt 100 */ >>> 2,101,... /* ind 2 started evnt 101 */ >>> 2,100,... /* ind 2 started evnt 100, repeated event */ >>> 1,102,... /* ind 1 started evnt 102 */ >>> 1,101,... /* ind 1 started evnt 101, repeated event */ >>> >>> Using the above data, it's easy to find out how many event starts there >>> were per individual (COUNT(*) GROUP BY individual_id). >>> With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) >>> along in the same query to find out how many unique events have been >>> started per individual. This is not possible with ibis. >>> The workaround for now is to group not only by individual_id but also by >>> event_id. I can then check whether COUNT(*) is bigger than 1 or not. >>> However, grouping not only by individual_id but also by event_id would >>> increase the result set size by a factor of 100 to 1000 in my case. >>> >>> >>> 2. You are right, it's a nice-to-have feature from my point of view as >>> well. STD() and VAR() can be replaced by calculating the values using a few >>> separate queries and putting the values together afterwards. Though I'd >>> think builtin support for these functions would outperform any workaround >>> solutions a great deal. >>> >>> >>> 3. I agree, and again, I can get to the same end result by issuing separate >>> queries and putting the results together afterwards. >>> >>> >>> 4. thanks for the suggestion. So far I only used the ibis command line and >>> did not my write my own front-end. I will try this in the next few days. >>> >>> >>> Other suggestions: >>> - I think the online documentation for IBIS doesn't mention there's >>> something like a LIMIT clause. At first, I didn't know why the result set >>> were always truncated to the first 10 rows only. I then looked into the >>> source and there I found that the LIMIT keyword is supported. This is nice >>> but as far as I can tell, it's nowhere mentioned in the online docs and it >>> might save other people time if it was put in there. >>> >>> - other nice aggregate functions on the same convenience level as STD() and >>> VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators >>> for the data distribution. >>> Like with STD() and VAR(), the results would be easy enough to query >>> without a specialized aggregate function, however, natively supporting >>> these functions might still produce results a lot faster than issuing 2 >>> separate queries and evaluating a longer query string for each tuple. >>> >>> - A MEDIAN() aggregate function would be absolutely great, however, I think >>> will be much harder to implement than the previous two. >>> >>> - A PERCENTILE() aggregate function would probably be more generic than >>> MEDIAN() and would be absolutely great for data distribution analysis. >>> >>> Best regards >>> Jan >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of K. John Wu >>> Sent: Friday, October 02, 2009 8:05 PM >>> To: FastBit Users >>> Subject: Re: [FastBit-users] Fastbit aggregate functions? >>> >>> Dear Jan, >>> >>> Thanks for your interested in our work. We appreciate your >>> suggestions and will put them on our list of things to do. >>> Unfortunately, some of the items that takes a lot of programming >>> effort might take a long time to come about. >>> >>> Regards, >>> >>> John >>> >>> >>> On 10/1/2009 11:45 PM, Jan Steemann wrote: >>>> Hi, >>>> >>>> I have been looking into Fastbit's support of aggregate functions and I >>>> have a few questions on that (or probably they turn out to be all feature >>>> requests): >>>> >>>> 1. is there currently any way to count the number of distinct values in a >>>> group? >>>> In some SQL products, I'd issue something like >>>> SELECT event_id, /* id of the event */ >>>> COUNT(*) /* how many times did the event occur */, >>>> COUNT(DISTINCT(item_id)) /* how many distinct items where affected >>>> by the event */ >>>> FROM table >>>> GROUP BY event_id; >>>> >>>> I have peeked at the ibis source code and did not find any equivalent for >>>> a COUNT(DISTINCT()) or a DISTINCT. >>> Currently, there are some support for count function, however, there >>> is some inconsistency in the display of the information. If you do >>> need to know the how many entries there are in a group, the easiest >>> thing to do is to look at the last column generated by the function >>> ibis::table::select, which is currently always COUNT(*). >>> >>> We don't current support the keyword DISTINCT. Because our target >>> applications do not currently use it, it might take a while for this >>> to be moved close to the top of our to-do list. With that said, we >>> are always looking for compelling applications. If you have a good >>> use-case, we might be able to make a case for moving it higher on our >>> to-do list. >>> >>>> 2. Native Fastbit support for more aggregate functions like STD(), VAR() >>>> would be great. >>> Yes, it would be nice to have these functions natively supported. >>> This is mostly a convenience issue. >>> >>>> 3. Conditional logic for aggregate functions would be absolutely great to >>>> create cross-breaks. >>>> In some SQL products, this would look like >>>> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, >>>> 0)) >>>> FROM table >>>> GROUP BY event_id >>> This can potentially be broken into a number of queries as follows >>> >>> select event_id, count(*) from table where event_type = 1; >>> select event_id, count(*) from table where event_type = 2; >>> select event_id, count(*) from table where event_type = 3; >>> ... >>> >>> >>>> 4. Finally, HAVING clauses operating on the results of grouped values >>>> would be a nice-to-have extension to Fastbit though not absolutely >>>> necessary (can be implemented outside of Fastbit). >>> The HAVING clause is generally a shorthand for nested queries, most of >>> which can be implemented as nested queries as follows in FastBit. >>> >>> select store_name, sum(sales) from store_table group by store_name >>> having sum(sales) >1500; >>> >>> ibis::table* result1 = store_table.select("store_name, sum(sales) as >>> total", "sales > 0"); // need a dummy where clause here >>> ibis:;table* result2 = result1->select("store_name, total", "total > >>> 1500"); >>> >>> If you do have a chance to try this, please let us know if you >>> encounters any problems. >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > > ------------------------------------------------------------------------ > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
