Hi John, yes, I will try to clean it up as much as I can and then resend it. It will probably take a week or so.
Best regards Jan -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of K. John Wu Sent: Monday, October 05, 2009 5:29 PM To: FastBit Users Subject: Re: [FastBit-users] Fastbit aggregate functions? Hi, Jan, Thanks for the offering to help. If you are willing to put in more effort, we would really appreciate it. John On 10/5/2009 8:23 AM, Jan Steemann wrote: > Hi John, > > thanks. > I think my code still has some issues. I suggest I clean it up first and then > resend the patch to save you unnecessary work. > Is that ok? > > Best regards > Jan > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of K. John Wu > Sent: Monday, October 05, 2009 5:14 PM > To: FastBit Users > Subject: Re: [FastBit-users] Fastbit aggregate functions? > > Hi, Jan, > > Thanks for the patch, we will look into integrate it into out code > shortly and will let you know as soon as we are done with that. > > John > > > On 10/3/2009 5:53 PM, Jan Steemann wrote: >> Hi John, >> >> I have played a little bit with the Fastbit 1.1.3 source code and have added >> basic support for a variance aggregate function. The function is named VAR() >> when invoked from queries. >> >> I have attached a diff with my changes to this email. >> >> Most of the changes have been done in src/colValues.cpp, where are the >> actual values are calculated. >> I also had to change the parser & lexer so they became aware of the new >> function. I have also touched a few other files and added the new function >> there. >> >> My built may have changed a few config or auto-generated build files >> unintentionally, maybe there are also configuration issues because I >> regenerated files with bison & flex. Please excuse any issues. >> >> Furthermore, my implementation for VAR() is definitely not optimal in terms >> of code & performance. It should be considered alpha quality only. I have >> neither tested it with edge cases, other column types, nor measured any >> performance impacts it may have for any existing queries. >> It was more or less intended as a try of how easy it would be to add >> functionality to the SQL interface. >> >> I think it must be cleaned up & tested before it should actually be added to >> the official source. However, I'd like to share the changeset if somebody >> else on the list is interested and wants to take it further. >> >> Please feel free to get back to me in case you should have any further >> questions. >> >> Best regards >> Jan >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Jan Steemann >> Sent: Saturday, October 03, 2009 11:12 PM >> To: FastBit Users >> Subject: Re: [FastBit-users] Fastbit aggregate functions? >> >> Hi John, >> >> thanks for getting back to me and for your suggestions. >> >> A few comments back: >> >> 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, >> however, I was particularly interested in distinct values. >> The application area is actually not that scientific. It's about analyzing >> web log files and there finding out which and how many actions individuals >> did. More precisely, it's about how many individuals did start specific >> events. Some example data follows: >> >> individual_id,event_id,timestamp >> 1,100,... /* ind 1 started evnt 100 */ >> 1,101,... /* ind 1 started evnt 101 */ >> 2,100,... /* ind 2 started evnt 100 */ >> 2,101,... /* ind 2 started evnt 101 */ >> 2,100,... /* ind 2 started evnt 100, repeated event */ >> 1,102,... /* ind 1 started evnt 102 */ >> 1,101,... /* ind 1 started evnt 101, repeated event */ >> >> Using the above data, it's easy to find out how many event starts there were >> per individual (COUNT(*) GROUP BY individual_id). >> With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) >> along in the same query to find out how many unique events have been started >> per individual. This is not possible with ibis. >> The workaround for now is to group not only by individual_id but also by >> event_id. I can then check whether COUNT(*) is bigger than 1 or not. >> However, grouping not only by individual_id but also by event_id would >> increase the result set size by a factor of 100 to 1000 in my case. >> >> >> 2. You are right, it's a nice-to-have feature from my point of view as well. >> STD() and VAR() can be replaced by calculating the values using a few >> separate queries and putting the values together afterwards. Though I'd >> think builtin support for these functions would outperform any workaround >> solutions a great deal. >> >> >> 3. I agree, and again, I can get to the same end result by issuing separate >> queries and putting the results together afterwards. >> >> >> 4. thanks for the suggestion. So far I only used the ibis command line and >> did not my write my own front-end. I will try this in the next few days. >> >> >> Other suggestions: >> - I think the online documentation for IBIS doesn't mention there's >> something like a LIMIT clause. At first, I didn't know why the result set >> were always truncated to the first 10 rows only. I then looked into the >> source and there I found that the LIMIT keyword is supported. This is nice >> but as far as I can tell, it's nowhere mentioned in the online docs and it >> might save other people time if it was put in there. >> >> - other nice aggregate functions on the same convenience level as STD() and >> VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators for >> the data distribution. >> Like with STD() and VAR(), the results would be easy enough to query without >> a specialized aggregate function, however, natively supporting these >> functions might still produce results a lot faster than issuing 2 separate >> queries and evaluating a longer query string for each tuple. >> >> - A MEDIAN() aggregate function would be absolutely great, however, I think >> will be much harder to implement than the previous two. >> >> - A PERCENTILE() aggregate function would probably be more generic than >> MEDIAN() and would be absolutely great for data distribution analysis. >> >> Best regards >> Jan >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of K. John Wu >> Sent: Friday, October 02, 2009 8:05 PM >> To: FastBit Users >> Subject: Re: [FastBit-users] Fastbit aggregate functions? >> >> Dear Jan, >> >> Thanks for your interested in our work. We appreciate your >> suggestions and will put them on our list of things to do. >> Unfortunately, some of the items that takes a lot of programming >> effort might take a long time to come about. >> >> Regards, >> >> John >> >> >> On 10/1/2009 11:45 PM, Jan Steemann wrote: >>> Hi, >>> >>> I have been looking into Fastbit's support of aggregate functions and I >>> have a few questions on that (or probably they turn out to be all feature >>> requests): >>> >>> 1. is there currently any way to count the number of distinct values in a >>> group? >>> In some SQL products, I'd issue something like >>> SELECT event_id, /* id of the event */ >>> COUNT(*) /* how many times did the event occur */, >>> COUNT(DISTINCT(item_id)) /* how many distinct items where affected >>> by the event */ >>> FROM table >>> GROUP BY event_id; >>> >>> I have peeked at the ibis source code and did not find any equivalent for a >>> COUNT(DISTINCT()) or a DISTINCT. >> Currently, there are some support for count function, however, there >> is some inconsistency in the display of the information. If you do >> need to know the how many entries there are in a group, the easiest >> thing to do is to look at the last column generated by the function >> ibis::table::select, which is currently always COUNT(*). >> >> We don't current support the keyword DISTINCT. Because our target >> applications do not currently use it, it might take a while for this >> to be moved close to the top of our to-do list. With that said, we >> are always looking for compelling applications. If you have a good >> use-case, we might be able to make a case for moving it higher on our >> to-do list. >> >>> 2. Native Fastbit support for more aggregate functions like STD(), VAR() >>> would be great. >> Yes, it would be nice to have these functions natively supported. >> This is mostly a convenience issue. >> >>> 3. Conditional logic for aggregate functions would be absolutely great to >>> create cross-breaks. >>> In some SQL products, this would look like >>> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, >>> 0)) >>> FROM table >>> GROUP BY event_id >> This can potentially be broken into a number of queries as follows >> >> select event_id, count(*) from table where event_type = 1; >> select event_id, count(*) from table where event_type = 2; >> select event_id, count(*) from table where event_type = 3; >> ... >> >> >>> 4. Finally, HAVING clauses operating on the results of grouped values would >>> be a nice-to-have extension to Fastbit though not absolutely necessary (can >>> be implemented outside of Fastbit). >> The HAVING clause is generally a shorthand for nested queries, most of >> which can be implemented as nested queries as follows in FastBit. >> >> select store_name, sum(sales) from store_table group by store_name >> having sum(sales) >1500; >> >> ibis::table* result1 = store_table.select("store_name, sum(sales) as >> total", "sales > 0"); // need a dummy where clause here >> ibis:;table* result2 = result1->select("store_name, total", "total > >> 1500"); >> >> If you do have a chance to try this, please let us know if you >> encounters any problems. >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
