Re: [FastBit-users] Fastbit aggregate functions?

K. John Wu Mon, 05 Oct 2009 08:14:34 -0700

Hi, Jan,

Thanks for the patch, we will look into integrate it into out code 
shortly and will let you know as soon as we are done with that.


John


On 10/3/2009 5:53 PM, Jan Steemann wrote:
> Hi John,
> 
> I have played a little bit with the Fastbit 1.1.3 source code and have added 
> basic support for a variance aggregate function. The function is named VAR() 
> when invoked from queries.
>  
> I have attached a diff with my changes to this email.
> 
> Most of the changes have been done in src/colValues.cpp, where are the actual 
> values are calculated.
> I also had to change the parser & lexer so they became aware of the new 
> function. I have also touched a few other files and added the new function 
> there.
> 
> My built may have changed a few config or auto-generated build files 
> unintentionally, maybe there are also configuration issues because I 
> regenerated files with bison & flex. Please excuse any issues.
> 
> Furthermore, my implementation for VAR() is definitely not optimal in terms 
> of code & performance. It should be considered alpha quality only. I have 
> neither tested it with edge cases, other column types, nor measured any 
> performance impacts it may have for any existing queries.
> It was more or less intended as a try of how easy it would be to add 
> functionality to the SQL interface.
> 
> I think it must be cleaned up & tested before it should actually be added to 
> the official source. However, I'd like to share the changeset if somebody 
> else on the list is interested and wants to take it further.
> 
> Please feel free to get back to me in case you should have any further 
> questions.
> 
> Best regards
> Jan
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Jan Steemann
> Sent: Saturday, October 03, 2009 11:12 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] Fastbit aggregate functions?
> 
> Hi John,
> 
> thanks for getting back to me and for your suggestions.
> 
> A few comments back:
> 
> 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, 
> however, I was particularly interested in distinct values. 
> The application area is actually not that scientific. It's about analyzing 
> web log files and there finding out which and how many actions individuals 
> did. More precisely, it's about how many individuals did start specific 
> events. Some example data follows:
> 
> individual_id,event_id,timestamp
> 1,100,... /* ind 1 started evnt 100 */
> 1,101,... /* ind 1 started evnt 101 */
> 2,100,... /* ind 2 started evnt 100 */
> 2,101,... /* ind 2 started evnt 101 */
> 2,100,... /* ind 2 started evnt 100, repeated event */
> 1,102,... /* ind 1 started evnt 102 */
> 1,101,... /* ind 1 started evnt 101, repeated event */
> 
> Using the above data, it's easy to find out how many event starts there were 
> per individual (COUNT(*) GROUP BY individual_id). 
> With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) 
> along in the same query to find out how many unique events have been started 
> per individual. This is not possible with ibis.
> The workaround for now is to group not only by individual_id but also by 
> event_id. I can then check whether COUNT(*) is bigger than 1 or not. However, 
> grouping not only by individual_id but also by event_id would increase the 
> result set size by a factor of 100 to 1000 in my case.
> 
> 
> 2. You are right, it's a nice-to-have feature from my point of view as well. 
> STD() and VAR() can be replaced by calculating the values using a few 
> separate queries and putting the values together afterwards. Though I'd think 
> builtin support for these functions would outperform any workaround solutions 
> a great deal. 
> 
> 
> 3. I agree, and again, I can get to the same end result by issuing separate 
> queries and putting the results together afterwards.
> 
> 
> 4. thanks for the suggestion. So far I only used the ibis command line and 
> did not my write my own front-end. I will try this in the next few days. 
> 
> 
> Other suggestions:
> - I think the online documentation for IBIS doesn't mention there's something 
> like a LIMIT clause. At first, I didn't know why the result set were always 
> truncated to the first 10 rows only. I then looked into the source and there 
> I found that the LIMIT keyword is supported. This is nice but as far as I can 
> tell, it's nowhere mentioned in the online docs and it might save other 
> people time if it was put in there.
> 
> - other nice aggregate functions on the same convenience level as STD() and 
> VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators for 
> the data distribution. 
> Like with STD() and VAR(), the results would be easy enough to query without 
> a specialized aggregate function, however, natively supporting these 
> functions might still produce results a lot faster than issuing 2 separate 
> queries and evaluating a longer query string for each tuple.
> 
> - A MEDIAN() aggregate function would be absolutely great, however, I think 
> will be much harder to implement than the previous two.
> 
> - A PERCENTILE() aggregate function would probably be more generic than 
> MEDIAN() and would be absolutely great for data distribution analysis.
> 
> Best regards
> Jan
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of K. John Wu
> Sent: Friday, October 02, 2009 8:05 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] Fastbit aggregate functions?
> 
> Dear Jan,
> 
> Thanks for your interested in our work.  We appreciate your 
> suggestions and will put them on our list of things to do. 
> Unfortunately, some of the items that takes a lot of programming 
> effort might take a long time to come about.
> 
> Regards,
> 
> John
> 
> 
> On 10/1/2009 11:45 PM, Jan Steemann wrote:
>> Hi,
>>
>> I have been looking into Fastbit's support of aggregate functions and I have 
>> a few questions on that (or probably they turn out to be all feature 
>> requests):
>>
>> 1. is there currently any way to count the number of distinct values in a 
>> group?
>> In some SQL products, I'd issue something like
>> SELECT event_id, /* id of the event */ 
>>        COUNT(*) /* how many times did the event occur */,
>>        COUNT(DISTINCT(item_id)) /* how many distinct items where affected by 
>> the event */
>> FROM table
>> GROUP BY event_id;
>>
>> I have peeked at the ibis source code and did not find any equivalent for a 
>> COUNT(DISTINCT()) or a DISTINCT.
> 
> Currently, there are some support for count function, however, there 
> is some inconsistency in the display of the information.  If you do 
> need to know the how many entries there are in a group, the easiest 
> thing to do is to look at the last column generated by the function 
> ibis::table::select, which is currently always COUNT(*).
> 
> We don't current support the keyword DISTINCT.  Because our target 
> applications do not currently use it, it might take a while for this 
> to be moved close to the top of our to-do list.  With that said, we 
> are always looking for compelling applications.  If you have a good 
> use-case, we might be able to make a case for moving it higher on our 
> to-do list.
> 
>>
>> 2. Native Fastbit support for more aggregate functions like STD(), VAR() 
>> would be great.
> 
> Yes, it would be nice to have these functions natively supported. 
> This is mostly a convenience issue.
> 
>>
>> 3. Conditional logic for aggregate functions would be absolutely great to 
>> create cross-breaks.
>> In some SQL products, this would look like
>> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, 
>> 0)) 
>> FROM table
>> GROUP BY event_id
> 
> This can potentially be broken into a number of queries as follows
> 
> select event_id, count(*) from table where event_type = 1;
> select event_id, count(*) from table where event_type = 2;
> select event_id, count(*) from table where event_type = 3;
> ...
> 
> 
>>
>> 4. Finally, HAVING clauses operating on the results of grouped values would 
>> be a nice-to-have extension to Fastbit though not absolutely necessary (can 
>> be implemented outside of Fastbit).
> 
> The HAVING clause is generally a shorthand for nested queries, most of 
> which can be implemented as nested queries as follows in FastBit.
> 
> select store_name, sum(sales) from store_table group by store_name 
> having sum(sales) >1500;
> 
> ibis::table* result1 = store_table.select("store_name, sum(sales) as 
> total", "sales > 0"); // need a dummy where clause here
> ibis:;table* result2 = result1->select("store_name, total", "total > 
> 1500");
> 
> If you do have a chance to try this, please let us know if you 
> encounters any problems.
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Fastbit aggregate functions?

Reply via email to