Hi, Jan, Thanks for the patch, we will look into integrate it into out code shortly and will let you know as soon as we are done with that.
John On 10/3/2009 5:53 PM, Jan Steemann wrote: > Hi John, > > I have played a little bit with the Fastbit 1.1.3 source code and have added > basic support for a variance aggregate function. The function is named VAR() > when invoked from queries. > > I have attached a diff with my changes to this email. > > Most of the changes have been done in src/colValues.cpp, where are the actual > values are calculated. > I also had to change the parser & lexer so they became aware of the new > function. I have also touched a few other files and added the new function > there. > > My built may have changed a few config or auto-generated build files > unintentionally, maybe there are also configuration issues because I > regenerated files with bison & flex. Please excuse any issues. > > Furthermore, my implementation for VAR() is definitely not optimal in terms > of code & performance. It should be considered alpha quality only. I have > neither tested it with edge cases, other column types, nor measured any > performance impacts it may have for any existing queries. > It was more or less intended as a try of how easy it would be to add > functionality to the SQL interface. > > I think it must be cleaned up & tested before it should actually be added to > the official source. However, I'd like to share the changeset if somebody > else on the list is interested and wants to take it further. > > Please feel free to get back to me in case you should have any further > questions. > > Best regards > Jan > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Jan Steemann > Sent: Saturday, October 03, 2009 11:12 PM > To: FastBit Users > Subject: Re: [FastBit-users] Fastbit aggregate functions? > > Hi John, > > thanks for getting back to me and for your suggestions. > > A few comments back: > > 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, > however, I was particularly interested in distinct values. > The application area is actually not that scientific. It's about analyzing > web log files and there finding out which and how many actions individuals > did. More precisely, it's about how many individuals did start specific > events. Some example data follows: > > individual_id,event_id,timestamp > 1,100,... /* ind 1 started evnt 100 */ > 1,101,... /* ind 1 started evnt 101 */ > 2,100,... /* ind 2 started evnt 100 */ > 2,101,... /* ind 2 started evnt 101 */ > 2,100,... /* ind 2 started evnt 100, repeated event */ > 1,102,... /* ind 1 started evnt 102 */ > 1,101,... /* ind 1 started evnt 101, repeated event */ > > Using the above data, it's easy to find out how many event starts there were > per individual (COUNT(*) GROUP BY individual_id). > With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) > along in the same query to find out how many unique events have been started > per individual. This is not possible with ibis. > The workaround for now is to group not only by individual_id but also by > event_id. I can then check whether COUNT(*) is bigger than 1 or not. However, > grouping not only by individual_id but also by event_id would increase the > result set size by a factor of 100 to 1000 in my case. > > > 2. You are right, it's a nice-to-have feature from my point of view as well. > STD() and VAR() can be replaced by calculating the values using a few > separate queries and putting the values together afterwards. Though I'd think > builtin support for these functions would outperform any workaround solutions > a great deal. > > > 3. I agree, and again, I can get to the same end result by issuing separate > queries and putting the results together afterwards. > > > 4. thanks for the suggestion. So far I only used the ibis command line and > did not my write my own front-end. I will try this in the next few days. > > > Other suggestions: > - I think the online documentation for IBIS doesn't mention there's something > like a LIMIT clause. At first, I didn't know why the result set were always > truncated to the first 10 rows only. I then looked into the source and there > I found that the LIMIT keyword is supported. This is nice but as far as I can > tell, it's nowhere mentioned in the online docs and it might save other > people time if it was put in there. > > - other nice aggregate functions on the same convenience level as STD() and > VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators for > the data distribution. > Like with STD() and VAR(), the results would be easy enough to query without > a specialized aggregate function, however, natively supporting these > functions might still produce results a lot faster than issuing 2 separate > queries and evaluating a longer query string for each tuple. > > - A MEDIAN() aggregate function would be absolutely great, however, I think > will be much harder to implement than the previous two. > > - A PERCENTILE() aggregate function would probably be more generic than > MEDIAN() and would be absolutely great for data distribution analysis. > > Best regards > Jan > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of K. John Wu > Sent: Friday, October 02, 2009 8:05 PM > To: FastBit Users > Subject: Re: [FastBit-users] Fastbit aggregate functions? > > Dear Jan, > > Thanks for your interested in our work. We appreciate your > suggestions and will put them on our list of things to do. > Unfortunately, some of the items that takes a lot of programming > effort might take a long time to come about. > > Regards, > > John > > > On 10/1/2009 11:45 PM, Jan Steemann wrote: >> Hi, >> >> I have been looking into Fastbit's support of aggregate functions and I have >> a few questions on that (or probably they turn out to be all feature >> requests): >> >> 1. is there currently any way to count the number of distinct values in a >> group? >> In some SQL products, I'd issue something like >> SELECT event_id, /* id of the event */ >> COUNT(*) /* how many times did the event occur */, >> COUNT(DISTINCT(item_id)) /* how many distinct items where affected by >> the event */ >> FROM table >> GROUP BY event_id; >> >> I have peeked at the ibis source code and did not find any equivalent for a >> COUNT(DISTINCT()) or a DISTINCT. > > Currently, there are some support for count function, however, there > is some inconsistency in the display of the information. If you do > need to know the how many entries there are in a group, the easiest > thing to do is to look at the last column generated by the function > ibis::table::select, which is currently always COUNT(*). > > We don't current support the keyword DISTINCT. Because our target > applications do not currently use it, it might take a while for this > to be moved close to the top of our to-do list. With that said, we > are always looking for compelling applications. If you have a good > use-case, we might be able to make a case for moving it higher on our > to-do list. > >> >> 2. Native Fastbit support for more aggregate functions like STD(), VAR() >> would be great. > > Yes, it would be nice to have these functions natively supported. > This is mostly a convenience issue. > >> >> 3. Conditional logic for aggregate functions would be absolutely great to >> create cross-breaks. >> In some SQL products, this would look like >> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, >> 0)) >> FROM table >> GROUP BY event_id > > This can potentially be broken into a number of queries as follows > > select event_id, count(*) from table where event_type = 1; > select event_id, count(*) from table where event_type = 2; > select event_id, count(*) from table where event_type = 3; > ... > > >> >> 4. Finally, HAVING clauses operating on the results of grouped values would >> be a nice-to-have extension to Fastbit though not absolutely necessary (can >> be implemented outside of Fastbit). > > The HAVING clause is generally a shorthand for nested queries, most of > which can be implemented as nested queries as follows in FastBit. > > select store_name, sum(sales) from store_table group by store_name > having sum(sales) >1500; > > ibis::table* result1 = store_table.select("store_name, sum(sales) as > total", "sales > 0"); // need a dummy where clause here > ibis:;table* result2 = result1->select("store_name, total", "total > > 1500"); > > If you do have a chance to try this, please let us know if you > encounters any problems. > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > > ------------------------------------------------------------------------ > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
