Re: [FastBit-users] Fastbit aggregate functions?

Jan Steemann Mon, 05 Oct 2009 08:40:02 -0700

Hi John,

yes, I will try to clean it up as much as I can and then resend it. 
It will probably take a week or so.


Best regards
Jan

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of K. John Wu
Sent: Monday, October 05, 2009 5:29 PM
To: FastBit Users
Subject: Re: [FastBit-users] Fastbit aggregate functions?

Hi, Jan,

Thanks for the offering to help.  If you are willing to put in more 
effort, we would really appreciate it.

John



On 10/5/2009 8:23 AM, Jan Steemann wrote:
> Hi John,
> 
> thanks. 
> I think my code still has some issues. I suggest I clean it up first and then 
> resend the patch to save you unnecessary work.
> Is that ok?
> 
> Best regards
> Jan
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of K. John Wu
> Sent: Monday, October 05, 2009 5:14 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] Fastbit aggregate functions?
> 
> Hi, Jan,
> 
> Thanks for the patch, we will look into integrate it into out code 
> shortly and will let you know as soon as we are done with that.
> 
> John
> 
> 
> On 10/3/2009 5:53 PM, Jan Steemann wrote:
>> Hi John,
>>
>> I have played a little bit with the Fastbit 1.1.3 source code and have added 
>> basic support for a variance aggregate function. The function is named VAR() 
>> when invoked from queries.
>>  
>> I have attached a diff with my changes to this email.
>>
>> Most of the changes have been done in src/colValues.cpp, where are the 
>> actual values are calculated.
>> I also had to change the parser & lexer so they became aware of the new 
>> function. I have also touched a few other files and added the new function 
>> there.
>>
>> My built may have changed a few config or auto-generated build files 
>> unintentionally, maybe there are also configuration issues because I 
>> regenerated files with bison & flex. Please excuse any issues.
>>
>> Furthermore, my implementation for VAR() is definitely not optimal in terms 
>> of code & performance. It should be considered alpha quality only. I have 
>> neither tested it with edge cases, other column types, nor measured any 
>> performance impacts it may have for any existing queries.
>> It was more or less intended as a try of how easy it would be to add 
>> functionality to the SQL interface.
>>
>> I think it must be cleaned up & tested before it should actually be added to 
>> the official source. However, I'd like to share the changeset if somebody 
>> else on the list is interested and wants to take it further.
>>
>> Please feel free to get back to me in case you should have any further 
>> questions.
>>
>> Best regards
>> Jan
>>
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Jan Steemann
>> Sent: Saturday, October 03, 2009 11:12 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>
>> Hi John,
>>
>> thanks for getting back to me and for your suggestions.
>>
>> A few comments back:
>>
>> 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, 
>> however, I was particularly interested in distinct values. 
>> The application area is actually not that scientific. It's about analyzing 
>> web log files and there finding out which and how many actions individuals 
>> did. More precisely, it's about how many individuals did start specific 
>> events. Some example data follows:
>>
>> individual_id,event_id,timestamp
>> 1,100,... /* ind 1 started evnt 100 */
>> 1,101,... /* ind 1 started evnt 101 */
>> 2,100,... /* ind 2 started evnt 100 */
>> 2,101,... /* ind 2 started evnt 101 */
>> 2,100,... /* ind 2 started evnt 100, repeated event */
>> 1,102,... /* ind 1 started evnt 102 */
>> 1,101,... /* ind 1 started evnt 101, repeated event */
>>
>> Using the above data, it's easy to find out how many event starts there were 
>> per individual (COUNT(*) GROUP BY individual_id). 
>> With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) 
>> along in the same query to find out how many unique events have been started 
>> per individual. This is not possible with ibis.
>> The workaround for now is to group not only by individual_id but also by 
>> event_id. I can then check whether COUNT(*) is bigger than 1 or not. 
>> However, grouping not only by individual_id but also by event_id would 
>> increase the result set size by a factor of 100 to 1000 in my case.
>>
>>
>> 2. You are right, it's a nice-to-have feature from my point of view as well. 
>> STD() and VAR() can be replaced by calculating the values using a few 
>> separate queries and putting the values together afterwards. Though I'd 
>> think builtin support for these functions would outperform any workaround 
>> solutions a great deal. 
>>
>>
>> 3. I agree, and again, I can get to the same end result by issuing separate 
>> queries and putting the results together afterwards.
>>
>>
>> 4. thanks for the suggestion. So far I only used the ibis command line and 
>> did not my write my own front-end. I will try this in the next few days. 
>>
>>
>> Other suggestions:
>> - I think the online documentation for IBIS doesn't mention there's 
>> something like a LIMIT clause. At first, I didn't know why the result set 
>> were always truncated to the first 10 rows only. I then looked into the 
>> source and there I found that the LIMIT keyword is supported. This is nice 
>> but as far as I can tell, it's nowhere mentioned in the online docs and it 
>> might save other people time if it was put in there.
>>
>> - other nice aggregate functions on the same convenience level as STD() and 
>> VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators for 
>> the data distribution. 
>> Like with STD() and VAR(), the results would be easy enough to query without 
>> a specialized aggregate function, however, natively supporting these 
>> functions might still produce results a lot faster than issuing 2 separate 
>> queries and evaluating a longer query string for each tuple.
>>
>> - A MEDIAN() aggregate function would be absolutely great, however, I think 
>> will be much harder to implement than the previous two.
>>
>> - A PERCENTILE() aggregate function would probably be more generic than 
>> MEDIAN() and would be absolutely great for data distribution analysis.
>>
>> Best regards
>> Jan
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of K. John Wu
>> Sent: Friday, October 02, 2009 8:05 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>
>> Dear Jan,
>>
>> Thanks for your interested in our work.  We appreciate your 
>> suggestions and will put them on our list of things to do. 
>> Unfortunately, some of the items that takes a lot of programming 
>> effort might take a long time to come about.
>>
>> Regards,
>>
>> John
>>
>>
>> On 10/1/2009 11:45 PM, Jan Steemann wrote:
>>> Hi,
>>>
>>> I have been looking into Fastbit's support of aggregate functions and I 
>>> have a few questions on that (or probably they turn out to be all feature 
>>> requests):
>>>
>>> 1. is there currently any way to count the number of distinct values in a 
>>> group?
>>> In some SQL products, I'd issue something like
>>> SELECT event_id, /* id of the event */ 
>>>        COUNT(*) /* how many times did the event occur */,
>>>        COUNT(DISTINCT(item_id)) /* how many distinct items where affected 
>>> by the event */
>>> FROM table
>>> GROUP BY event_id;
>>>
>>> I have peeked at the ibis source code and did not find any equivalent for a 
>>> COUNT(DISTINCT()) or a DISTINCT.
>> Currently, there are some support for count function, however, there 
>> is some inconsistency in the display of the information.  If you do 
>> need to know the how many entries there are in a group, the easiest 
>> thing to do is to look at the last column generated by the function 
>> ibis::table::select, which is currently always COUNT(*).
>>
>> We don't current support the keyword DISTINCT.  Because our target 
>> applications do not currently use it, it might take a while for this 
>> to be moved close to the top of our to-do list.  With that said, we 
>> are always looking for compelling applications.  If you have a good 
>> use-case, we might be able to make a case for moving it higher on our 
>> to-do list.
>>
>>> 2. Native Fastbit support for more aggregate functions like STD(), VAR() 
>>> would be great.
>> Yes, it would be nice to have these functions natively supported. 
>> This is mostly a convenience issue.
>>
>>> 3. Conditional logic for aggregate functions would be absolutely great to 
>>> create cross-breaks.
>>> In some SQL products, this would look like
>>> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, 
>>> 0)) 
>>> FROM table
>>> GROUP BY event_id
>> This can potentially be broken into a number of queries as follows
>>
>> select event_id, count(*) from table where event_type = 1;
>> select event_id, count(*) from table where event_type = 2;
>> select event_id, count(*) from table where event_type = 3;
>> ...
>>
>>
>>> 4. Finally, HAVING clauses operating on the results of grouped values would 
>>> be a nice-to-have extension to Fastbit though not absolutely necessary (can 
>>> be implemented outside of Fastbit).
>> The HAVING clause is generally a shorthand for nested queries, most of 
>> which can be implemented as nested queries as follows in FastBit.
>>
>> select store_name, sum(sales) from store_table group by store_name 
>> having sum(sales) >1500;
>>
>> ibis::table* result1 = store_table.select("store_name, sum(sales) as 
>> total", "sales > 0"); // need a dummy where clause here
>> ibis:;table* result2 = result1->select("store_name, total", "total > 
>> 1500");
>>
>> If you do have a chance to try this, please let us know if you 
>> encounters any problems.
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Fastbit aggregate functions?

Reply via email to