Re: [FastBit-users] Fastbit aggregate functions?

Jan Steemann Sat, 10 Oct 2009 12:47:06 -0700

Hi John,

please find below two observations I made when playing around with Fastbit that 
I'd like to share:



1.) Column order of count(*) is relevant

When selecting count(*) together with other columns, then the order of the 
count(*) in the columns list really makes a difference. If count(*) is first in 
the select list, all other columns seem to be skipped from the result set.

Example:
ibis -d test -n -q "select type,count(*) where 1=1 limit 10000"
tableSelect -- select(type,count(*), 1=1 ) on table T-test produced a table 
with 11 rows and 2 columns
tableSelect -- the result table (11 x 2) for "SELECT type,count(*) FROM T-test 
WHERE 1=1  LIMIT 10000"
1, 9089780
...

This works as expected (result set contains 2 columns).


Now, move count(*) to the front of the select list:

ibis -d test -n -q "select count(*),type where 1=1 limit 10000"
tableSelect -- select(count(*),type, 1=1 ) on table T-test produced a table 
with 1 row and 1 column
tableSelect -- the result table (1 x 1) for "SELECT count(*),type FROM T-test 
WHERE 1=1  LIMIT 10000"
100000000

This is unexpected as the result set now only has one column but I specified 2 
(just changed their order).



The same also happens when additional columns are put into the select list:

ibis -d test -n -q "select count(*),type,max(pid) where 1=1 limit 10000"
tableSelect -- select(count(*),type,max(pid), 1=1 ) on table T-test produced a 
table with 1 row and 1 column
tableSelect -- the result table (1 x 1) for "SELECT count(*),type,max(pid) FROM 
T-test WHERE 1=1  LIMIT 10000"
100000000

The result set again has only 1 instead of 3 columns.

I think everything after count(*) in the select list is silently discarded.
Maybe this is a feature, but I would consider it to be a rather unusal one.




2.) Specified limit clause is ignored until output is created, creating 
unnecessary overhead

When running a query that produces 46M groups from a table with 100M records, 
the limit clause is not taken into account until the final results are 
returned. I think this leads to unnecessary overhead because the groups are 
still sorted, even if they are not used later in the final results.

Example invocation:
ibis -v=9 -d test -n -q "select pid,uid,count(*) where 1=1 limit 1"

There is an explicit limit (1) specified (it's also used when returning the 
results) and there is no explicit order clause.
However, the 46M result values are still reordered before the 1 row is returned:

... (from the debug output) ...
ibis::bord::part constructed in-memory data partition TTmJJu -- GROUP BY pid, 
uid on table TNQr6G (SELECT pid,uid,count(*) FROM test WHERE 1=1 )
with 46627293 rows and 3 columns
clearing data partition TNQr6G
clearing column TNQr6G.pid
clearing column TNQr6G.uid
  table::select("pid,uid,count(*)", 1=1 ) -- duration: 70.0524 sec(CPU), 
71.3946 sec(elapsed)
tableSelect -- select(pid,uid,count(*), 1=1 ) on table T-test produced a table 
with 46627293 rows and 3 columns

...

part[TTmJJu]::reorder -- user did not specify ordering keys, will attempt to 
use all integer columns as ordering keys
column[TTmJJu.COUNT(*)](UINT)::computeMinMax -- nelm = 46627293, min = 1, max = 
12
column[TTmJJu.pid](INT)::computeMinMax -- nelm = 46627293, min = 105, max = 1100
column[TTmJJu.uid](INT)::computeMinMax -- nelm = 46627293, min = 1, max = 500000
  part[TTmJJu]::reorder(COUNT(*), pid, uid) -- start timer ...
part[TTmJJu]::reorderValues -- (re)set array starts to contain [0, 46627293]

...

part[TTmJJu]::reorderValues -- reordered 46627293 values (into 1611 segments) 
in 4.16826 sec(CPU), 4.16807 sec(elapsed)
part[TTmJJu]::reorderValues -- reordered 46627293 values (into 46627293 
segments) in 11.0567 sec(CPU), 11.0564 sec(elapsed)
  part[TTmJJu]::reorder(COUNT(*), pid, uid) -- duration: 16.8811 sec(CPU), 
16.8814 sec(elapsed)


If am not totally wrong, this reordervalues operation works on the already 
reduced data. If that's true I think it can safely be optimized away in a 
situation where there is no explicit order clause specified. Again, maybe I hit 
a feature that there should be always be a guaranteed implicit sort order even 
if no explicit order is specified. But if not, optimizing away the extra sort 
could increase the performance of queries with many groups plus no order and a 
limit.


Finally, I ran into problems when loading data from a 4 column, 100M rows CSV 
file using ardea. It seemed to always hit some memory limits and aborted.
Splliting the 100M rows file into 2 files with 50M rows each and loading them 
seperately worked fine.
However, is there a way to either cap the memory usage for data loading so the 
size of the input file does not matter (with the tradeoff of the import taking 
longer) or to predict the maximum size an import file can have so it can be 
loaded with a given amount of memory?

Thanks for your answer and regards
Jan


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of K. John Wu
Sent: Monday, October 05, 2009 5:29 PM
To: FastBit Users
Subject: Re: [FastBit-users] Fastbit aggregate functions?

Hi, Jan,

Thanks for the offering to help.  If you are willing to put in more
effort, we would really appreciate it.

John



On 10/5/2009 8:23 AM, Jan Steemann wrote:
> Hi John,
>
> thanks.
> I think my code still has some issues. I suggest I clean it up first and then 
> resend the patch to save you unnecessary work.
> Is that ok?
>
> Best regards
> Jan
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of K. John Wu
> Sent: Monday, October 05, 2009 5:14 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>
> Hi, Jan,
>
> Thanks for the patch, we will look into integrate it into out code
> shortly and will let you know as soon as we are done with that.
>
> John
>
>
> On 10/3/2009 5:53 PM, Jan Steemann wrote:
>> Hi John,
>>
>> I have played a little bit with the Fastbit 1.1.3 source code and have added 
>> basic support for a variance aggregate function. The function is named VAR() 
>> when invoked from queries.
>>
>> I have attached a diff with my changes to this email.
>>
>> Most of the changes have been done in src/colValues.cpp, where are the 
>> actual values are calculated.
>> I also had to change the parser & lexer so they became aware of the new 
>> function. I have also touched a few other files and added the new function 
>> there.
>>
>> My built may have changed a few config or auto-generated build files 
>> unintentionally, maybe there are also configuration issues because I 
>> regenerated files with bison & flex. Please excuse any issues.
>>
>> Furthermore, my implementation for VAR() is definitely not optimal in terms 
>> of code & performance. It should be considered alpha quality only. I have 
>> neither tested it with edge cases, other column types, nor measured any 
>> performance impacts it may have for any existing queries.
>> It was more or less intended as a try of how easy it would be to add 
>> functionality to the SQL interface.
>>
>> I think it must be cleaned up & tested before it should actually be added to 
>> the official source. However, I'd like to share the changeset if somebody 
>> else on the list is interested and wants to take it further.
>>
>> Please feel free to get back to me in case you should have any further 
>> questions.
>>
>> Best regards
>> Jan
>>
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Jan Steemann
>> Sent: Saturday, October 03, 2009 11:12 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>
>> Hi John,
>>
>> thanks for getting back to me and for your suggestions.
>>
>> A few comments back:
>>
>> 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, 
>> however, I was particularly interested in distinct values.
>> The application area is actually not that scientific. It's about analyzing 
>> web log files and there finding out which and how many actions individuals 
>> did. More precisely, it's about how many individuals did start specific 
>> events. Some example data follows:
>>
>> individual_id,event_id,timestamp
>> 1,100,... /* ind 1 started evnt 100 */
>> 1,101,... /* ind 1 started evnt 101 */
>> 2,100,... /* ind 2 started evnt 100 */
>> 2,101,... /* ind 2 started evnt 101 */
>> 2,100,... /* ind 2 started evnt 100, repeated event */
>> 1,102,... /* ind 1 started evnt 102 */
>> 1,101,... /* ind 1 started evnt 101, repeated event */
>>
>> Using the above data, it's easy to find out how many event starts there were 
>> per individual (COUNT(*) GROUP BY individual_id).
>> With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) 
>> along in the same query to find out how many unique events have been started 
>> per individual. This is not possible with ibis.
>> The workaround for now is to group not only by individual_id but also by 
>> event_id. I can then check whether COUNT(*) is bigger than 1 or not. 
>> However, grouping not only by individual_id but also by event_id would 
>> increase the result set size by a factor of 100 to 1000 in my case.
>>
>>
>> 2. You are right, it's a nice-to-have feature from my point of view as well. 
>> STD() and VAR() can be replaced by calculating the values using a few 
>> separate queries and putting the values together afterwards. Though I'd 
>> think builtin support for these functions would outperform any workaround 
>> solutions a great deal.
>>
>>
>> 3. I agree, and again, I can get to the same end result by issuing separate 
>> queries and putting the results together afterwards.
>>
>>
>> 4. thanks for the suggestion. So far I only used the ibis command line and 
>> did not my write my own front-end. I will try this in the next few days.
>>
>>
>> Other suggestions:
>> - I think the online documentation for IBIS doesn't mention there's 
>> something like a LIMIT clause. At first, I didn't know why the result set 
>> were always truncated to the first 10 rows only. I then looked into the 
>> source and there I found that the LIMIT keyword is supported. This is nice 
>> but as far as I can tell, it's nowhere mentioned in the online docs and it 
>> might save other people time if it was put in there.
>>
>> - other nice aggregate functions on the same convenience level as STD() and 
>> VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators for 
>> the data distribution.
>> Like with STD() and VAR(), the results would be easy enough to query without 
>> a specialized aggregate function, however, natively supporting these 
>> functions might still produce results a lot faster than issuing 2 separate 
>> queries and evaluating a longer query string for each tuple.
>>
>> - A MEDIAN() aggregate function would be absolutely great, however, I think 
>> will be much harder to implement than the previous two.
>>
>> - A PERCENTILE() aggregate function would probably be more generic than 
>> MEDIAN() and would be absolutely great for data distribution analysis.
>>
>> Best regards
>> Jan
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of K. John Wu
>> Sent: Friday, October 02, 2009 8:05 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>
>> Dear Jan,
>>
>> Thanks for your interested in our work.  We appreciate your
>> suggestions and will put them on our list of things to do.
>> Unfortunately, some of the items that takes a lot of programming
>> effort might take a long time to come about.
>>
>> Regards,
>>
>> John
>>
>>
>> On 10/1/2009 11:45 PM, Jan Steemann wrote:
>>> Hi,
>>>
>>> I have been looking into Fastbit's support of aggregate functions and I 
>>> have a few questions on that (or probably they turn out to be all feature 
>>> requests):
>>>
>>> 1. is there currently any way to count the number of distinct values in a 
>>> group?
>>> In some SQL products, I'd issue something like
>>> SELECT event_id, /* id of the event */
>>>        COUNT(*) /* how many times did the event occur */,
>>>        COUNT(DISTINCT(item_id)) /* how many distinct items where affected 
>>> by the event */
>>> FROM table
>>> GROUP BY event_id;
>>>
>>> I have peeked at the ibis source code and did not find any equivalent for a 
>>> COUNT(DISTINCT()) or a DISTINCT.
>> Currently, there are some support for count function, however, there
>> is some inconsistency in the display of the information.  If you do
>> need to know the how many entries there are in a group, the easiest
>> thing to do is to look at the last column generated by the function
>> ibis::table::select, which is currently always COUNT(*).
>>
>> We don't current support the keyword DISTINCT.  Because our target
>> applications do not currently use it, it might take a while for this
>> to be moved close to the top of our to-do list.  With that said, we
>> are always looking for compelling applications.  If you have a good
>> use-case, we might be able to make a case for moving it higher on our
>> to-do list.
>>
>>> 2. Native Fastbit support for more aggregate functions like STD(), VAR() 
>>> would be great.
>> Yes, it would be nice to have these functions natively supported.
>> This is mostly a convenience issue.
>>
>>> 3. Conditional logic for aggregate functions would be absolutely great to 
>>> create cross-breaks.
>>> In some SQL products, this would look like
>>> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, 
>>> 0))
>>> FROM table
>>> GROUP BY event_id
>> This can potentially be broken into a number of queries as follows
>>
>> select event_id, count(*) from table where event_type = 1;
>> select event_id, count(*) from table where event_type = 2;
>> select event_id, count(*) from table where event_type = 3;
>> ...
>>
>>
>>> 4. Finally, HAVING clauses operating on the results of grouped values would 
>>> be a nice-to-have extension to Fastbit though not absolutely necessary (can 
>>> be implemented outside of Fastbit).
>> The HAVING clause is generally a shorthand for nested queries, most of
>> which can be implemented as nested queries as follows in FastBit.
>>
>> select store_name, sum(sales) from store_table group by store_name
>> having sum(sales) >1500;
>>
>> ibis::table* result1 = store_table.select("store_name, sum(sales) as
>> total", "sales > 0"); // need a dummy where clause here
>> ibis:;table* result2 = result1->select("store_name, total", "total >
>> 1500");
>>
>> If you do have a chance to try this, please let us know if you
>> encounters any problems.
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Fastbit aggregate functions?

Reply via email to