Re: [FastBit-users] Fastbit aggregate functions?

K. John Wu Sun, 11 Oct 2009 15:56:09 -0700

Hi, Jan,

Thanks for the observations.  We will look into them shortly and will 
get back to you when we start working on the issues.


John


On 10/10/2009 12:46 PM, Jan Steemann wrote:
> Hi John,
> 
> please find below two observations I made when playing around with Fastbit 
> that I'd like to share:
> 
> 
> 1.) Column order of count(*) is relevant
> 
> When selecting count(*) together with other columns, then the order of the 
> count(*) in the columns list really makes a difference. If count(*) is first 
> in the select list, all other columns seem to be skipped from the result set.
> 
> Example:
> ibis -d test -n -q "select type,count(*) where 1=1 limit 10000"
> tableSelect -- select(type,count(*), 1=1 ) on table T-test produced a table 
> with 11 rows and 2 columns
> tableSelect -- the result table (11 x 2) for "SELECT type,count(*) FROM 
> T-test WHERE 1=1  LIMIT 10000"
> 1, 9089780
> ...
> 
> This works as expected (result set contains 2 columns).
> 
> 
> Now, move count(*) to the front of the select list:
> 
> ibis -d test -n -q "select count(*),type where 1=1 limit 10000"
> tableSelect -- select(count(*),type, 1=1 ) on table T-test produced a table 
> with 1 row and 1 column
> tableSelect -- the result table (1 x 1) for "SELECT count(*),type FROM T-test 
> WHERE 1=1  LIMIT 10000"
> 100000000
> 
> This is unexpected as the result set now only has one column but I specified 
> 2 (just changed their order).
> 
> 
> 
> The same also happens when additional columns are put into the select list:
> 
> ibis -d test -n -q "select count(*),type,max(pid) where 1=1 limit 10000"
> tableSelect -- select(count(*),type,max(pid), 1=1 ) on table T-test produced 
> a table with 1 row and 1 column
> tableSelect -- the result table (1 x 1) for "SELECT count(*),type,max(pid) 
> FROM T-test WHERE 1=1  LIMIT 10000"
> 100000000
> 
> The result set again has only 1 instead of 3 columns.
> 
> I think everything after count(*) in the select list is silently discarded.
> Maybe this is a feature, but I would consider it to be a rather unusal one.
> 
> 
> 
> 
> 2.) Specified limit clause is ignored until output is created, creating 
> unnecessary overhead
> 
> When running a query that produces 46M groups from a table with 100M records, 
> the limit clause is not taken into account until the final results are 
> returned. I think this leads to unnecessary overhead because the groups are 
> still sorted, even if they are not used later in the final results.
> 
> Example invocation:
> ibis -v=9 -d test -n -q "select pid,uid,count(*) where 1=1 limit 1"
> 
> There is an explicit limit (1) specified (it's also used when returning the 
> results) and there is no explicit order clause.
> However, the 46M result values are still reordered before the 1 row is 
> returned:
> 
> ... (from the debug output) ...
> ibis::bord::part constructed in-memory data partition TTmJJu -- GROUP BY pid, 
> uid on table TNQr6G (SELECT pid,uid,count(*) FROM test WHERE 1=1 )
> with 46627293 rows and 3 columns
> clearing data partition TNQr6G
> clearing column TNQr6G.pid
> clearing column TNQr6G.uid
>   table::select("pid,uid,count(*)", 1=1 ) -- duration: 70.0524 sec(CPU), 
> 71.3946 sec(elapsed)
> tableSelect -- select(pid,uid,count(*), 1=1 ) on table T-test produced a 
> table with 46627293 rows and 3 columns
> 
> ...
> 
> part[TTmJJu]::reorder -- user did not specify ordering keys, will attempt to 
> use all integer columns as ordering keys
> column[TTmJJu.COUNT(*)](UINT)::computeMinMax -- nelm = 46627293, min = 1, max 
> = 12
> column[TTmJJu.pid](INT)::computeMinMax -- nelm = 46627293, min = 105, max = 
> 1100
> column[TTmJJu.uid](INT)::computeMinMax -- nelm = 46627293, min = 1, max = 
> 500000
>   part[TTmJJu]::reorder(COUNT(*), pid, uid) -- start timer ...
> part[TTmJJu]::reorderValues -- (re)set array starts to contain [0, 46627293]
> 
> ...
> 
> part[TTmJJu]::reorderValues -- reordered 46627293 values (into 1611 segments) 
> in 4.16826 sec(CPU), 4.16807 sec(elapsed)
> part[TTmJJu]::reorderValues -- reordered 46627293 values (into 46627293 
> segments) in 11.0567 sec(CPU), 11.0564 sec(elapsed)
>   part[TTmJJu]::reorder(COUNT(*), pid, uid) -- duration: 16.8811 sec(CPU), 
> 16.8814 sec(elapsed)
> 
> 
> If am not totally wrong, this reordervalues operation works on the already 
> reduced data. If that's true I think it can safely be optimized away in a 
> situation where there is no explicit order clause specified. Again, maybe I 
> hit a feature that there should be always be a guaranteed implicit sort order 
> even if no explicit order is specified. But if not, optimizing away the extra 
> sort could increase the performance of queries with many groups plus no order 
> and a limit.
> 
> 
> Finally, I ran into problems when loading data from a 4 column, 100M rows CSV 
> file using ardea. It seemed to always hit some memory limits and aborted.
> Splliting the 100M rows file into 2 files with 50M rows each and loading them 
> seperately worked fine.
> However, is there a way to either cap the memory usage for data loading so 
> the size of the input file does not matter (with the tradeoff of the import 
> taking longer) or to predict the maximum size an import file can have so it 
> can be loaded with a given amount of memory?
> 
> Thanks for your answer and regards
> Jan
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of K. John Wu
> Sent: Monday, October 05, 2009 5:29 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] Fastbit aggregate functions?
> 
> Hi, Jan,
> 
> Thanks for the offering to help.  If you are willing to put in more
> effort, we would really appreciate it.
> 
> John
> 
> 
> 
> On 10/5/2009 8:23 AM, Jan Steemann wrote:
>> Hi John,
>>
>> thanks.
>> I think my code still has some issues. I suggest I clean it up first and 
>> then resend the patch to save you unnecessary work.
>> Is that ok?
>>
>> Best regards
>> Jan
>>
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of K. John Wu
>> Sent: Monday, October 05, 2009 5:14 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>
>> Hi, Jan,
>>
>> Thanks for the patch, we will look into integrate it into out code
>> shortly and will let you know as soon as we are done with that.
>>
>> John
>>
>>
>> On 10/3/2009 5:53 PM, Jan Steemann wrote:
>>> Hi John,
>>>
>>> I have played a little bit with the Fastbit 1.1.3 source code and have 
>>> added basic support for a variance aggregate function. The function is 
>>> named VAR() when invoked from queries.
>>>
>>> I have attached a diff with my changes to this email.
>>>
>>> Most of the changes have been done in src/colValues.cpp, where are the 
>>> actual values are calculated.
>>> I also had to change the parser & lexer so they became aware of the new 
>>> function. I have also touched a few other files and added the new function 
>>> there.
>>>
>>> My built may have changed a few config or auto-generated build files 
>>> unintentionally, maybe there are also configuration issues because I 
>>> regenerated files with bison & flex. Please excuse any issues.
>>>
>>> Furthermore, my implementation for VAR() is definitely not optimal in terms 
>>> of code & performance. It should be considered alpha quality only. I have 
>>> neither tested it with edge cases, other column types, nor measured any 
>>> performance impacts it may have for any existing queries.
>>> It was more or less intended as a try of how easy it would be to add 
>>> functionality to the SQL interface.
>>>
>>> I think it must be cleaned up & tested before it should actually be added 
>>> to the official source. However, I'd like to share the changeset if 
>>> somebody else on the list is interested and wants to take it further.
>>>
>>> Please feel free to get back to me in case you should have any further 
>>> questions.
>>>
>>> Best regards
>>> Jan
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Jan Steemann
>>> Sent: Saturday, October 03, 2009 11:12 PM
>>> To: FastBit Users
>>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>>
>>> Hi John,
>>>
>>> thanks for getting back to me and for your suggestions.
>>>
>>> A few comments back:
>>>
>>> 1. DISTINCT: I found the COUNT(*) results at the end of the results rows, 
>>> however, I was particularly interested in distinct values.
>>> The application area is actually not that scientific. It's about analyzing 
>>> web log files and there finding out which and how many actions individuals 
>>> did. More precisely, it's about how many individuals did start specific 
>>> events. Some example data follows:
>>>
>>> individual_id,event_id,timestamp
>>> 1,100,... /* ind 1 started evnt 100 */
>>> 1,101,... /* ind 1 started evnt 101 */
>>> 2,100,... /* ind 2 started evnt 100 */
>>> 2,101,... /* ind 2 started evnt 101 */
>>> 2,100,... /* ind 2 started evnt 100, repeated event */
>>> 1,102,... /* ind 1 started evnt 102 */
>>> 1,101,... /* ind 1 started evnt 101, repeated event */
>>>
>>> Using the above data, it's easy to find out how many event starts there 
>>> were per individual (COUNT(*) GROUP BY individual_id).
>>> With SQL, I can also do a COUNT(DISTINCT(event_id) GROUP BY individual_id) 
>>> along in the same query to find out how many unique events have been 
>>> started per individual. This is not possible with ibis.
>>> The workaround for now is to group not only by individual_id but also by 
>>> event_id. I can then check whether COUNT(*) is bigger than 1 or not. 
>>> However, grouping not only by individual_id but also by event_id would 
>>> increase the result set size by a factor of 100 to 1000 in my case.
>>>
>>>
>>> 2. You are right, it's a nice-to-have feature from my point of view as 
>>> well. STD() and VAR() can be replaced by calculating the values using a few 
>>> separate queries and putting the values together afterwards. Though I'd 
>>> think builtin support for these functions would outperform any workaround 
>>> solutions a great deal.
>>>
>>>
>>> 3. I agree, and again, I can get to the same end result by issuing separate 
>>> queries and putting the results together afterwards.
>>>
>>>
>>> 4. thanks for the suggestion. So far I only used the ibis command line and 
>>> did not my write my own front-end. I will try this in the next few days.
>>>
>>>
>>> Other suggestions:
>>> - I think the online documentation for IBIS doesn't mention there's 
>>> something like a LIMIT clause. At first, I didn't know why the result set 
>>> were always truncated to the first 10 rows only. I then looked into the 
>>> source and there I found that the LIMIT keyword is supported. This is nice 
>>> but as far as I can tell, it's nowhere mentioned in the online docs and it 
>>> might save other people time if it was put in there.
>>>
>>> - other nice aggregate functions on the same convenience level as STD() and 
>>> VAR() would be: SKEWNESS(), CURTOSIS() as they can be used as indicators 
>>> for the data distribution.
>>> Like with STD() and VAR(), the results would be easy enough to query 
>>> without a specialized aggregate function, however, natively supporting 
>>> these functions might still produce results a lot faster than issuing 2 
>>> separate queries and evaluating a longer query string for each tuple.
>>>
>>> - A MEDIAN() aggregate function would be absolutely great, however, I think 
>>> will be much harder to implement than the previous two.
>>>
>>> - A PERCENTILE() aggregate function would probably be more generic than 
>>> MEDIAN() and would be absolutely great for data distribution analysis.
>>>
>>> Best regards
>>> Jan
>>>
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of K. John Wu
>>> Sent: Friday, October 02, 2009 8:05 PM
>>> To: FastBit Users
>>> Subject: Re: [FastBit-users] Fastbit aggregate functions?
>>>
>>> Dear Jan,
>>>
>>> Thanks for your interested in our work.  We appreciate your
>>> suggestions and will put them on our list of things to do.
>>> Unfortunately, some of the items that takes a lot of programming
>>> effort might take a long time to come about.
>>>
>>> Regards,
>>>
>>> John
>>>
>>>
>>> On 10/1/2009 11:45 PM, Jan Steemann wrote:
>>>> Hi,
>>>>
>>>> I have been looking into Fastbit's support of aggregate functions and I 
>>>> have a few questions on that (or probably they turn out to be all feature 
>>>> requests):
>>>>
>>>> 1. is there currently any way to count the number of distinct values in a 
>>>> group?
>>>> In some SQL products, I'd issue something like
>>>> SELECT event_id, /* id of the event */
>>>>        COUNT(*) /* how many times did the event occur */,
>>>>        COUNT(DISTINCT(item_id)) /* how many distinct items where affected 
>>>> by the event */
>>>> FROM table
>>>> GROUP BY event_id;
>>>>
>>>> I have peeked at the ibis source code and did not find any equivalent for 
>>>> a COUNT(DISTINCT()) or a DISTINCT.
>>> Currently, there are some support for count function, however, there
>>> is some inconsistency in the display of the information.  If you do
>>> need to know the how many entries there are in a group, the easiest
>>> thing to do is to look at the last column generated by the function
>>> ibis::table::select, which is currently always COUNT(*).
>>>
>>> We don't current support the keyword DISTINCT.  Because our target
>>> applications do not currently use it, it might take a while for this
>>> to be moved close to the top of our to-do list.  With that said, we
>>> are always looking for compelling applications.  If you have a good
>>> use-case, we might be able to make a case for moving it higher on our
>>> to-do list.
>>>
>>>> 2. Native Fastbit support for more aggregate functions like STD(), VAR() 
>>>> would be great.
>>> Yes, it would be nice to have these functions natively supported.
>>> This is mostly a convenience issue.
>>>
>>>> 3. Conditional logic for aggregate functions would be absolutely great to 
>>>> create cross-breaks.
>>>> In some SQL products, this would look like
>>>> SELECT event_id, SUM(IF(event_type = 1, 1, 0)), SUM(IF(event_type = 2, 1, 
>>>> 0))
>>>> FROM table
>>>> GROUP BY event_id
>>> This can potentially be broken into a number of queries as follows
>>>
>>> select event_id, count(*) from table where event_type = 1;
>>> select event_id, count(*) from table where event_type = 2;
>>> select event_id, count(*) from table where event_type = 3;
>>> ...
>>>
>>>
>>>> 4. Finally, HAVING clauses operating on the results of grouped values 
>>>> would be a nice-to-have extension to Fastbit though not absolutely 
>>>> necessary (can be implemented outside of Fastbit).
>>> The HAVING clause is generally a shorthand for nested queries, most of
>>> which can be implemented as nested queries as follows in FastBit.
>>>
>>> select store_name, sum(sales) from store_table group by store_name
>>> having sum(sales) >1500;
>>>
>>> ibis::table* result1 = store_table.select("store_name, sum(sales) as
>>> total", "sales > 0"); // need a dummy where clause here
>>> ibis:;table* result2 = result1->select("store_name, total", "total >
>>> 1500");
>>>
>>> If you do have a chance to try this, please let us know if you
>>> encounters any problems.
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Fastbit aggregate functions?

Reply via email to