[jira] [Commented] (HBASE-1512) Coprocessors: Support aggregate functions

Himanshu Vashishtha (JIRA) Sun, 27 Mar 2011 00:35:27 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011750#comment-13011750
 ]


Himanshu Vashishtha commented on HBASE-1512:
--------------------------------------------

Stack: Thanks for the review. 

I have revamped the patch and also incorporated your suggestions. There were 
bunch of discrepancies regarding the boundary conditions you mentioned in the 
previous version, where at the region level there was no knowledge of the exact 
start/stop rows as given by the user. To achieve this, I modified the agg 
functions signatures to include start/stop rows at the region level.

Following are some key aspects for this version:
a) startEow < endRow is an essential condition now (other than when one is 
doing a full table scan, where startRow and endRow both are empty byte array). 
This helps in handling boundary conditions where the start row provided by the 
user is start row of a region (the default scanner impl returns null because it 
is a non-get query). Moreover, it is also aligned with the logic of these 
functions, where one is finding max, min, row count etc.

b) For all computations like avg, sum, max etc, it is assumed the cell value is 
a long value (8 bytes); if this is not the case, that cell value is skipped 
from the computation

c) For all functions, column family is essential (if it is null, an ioe is 
returned). 
For max, min, avg, sum,std, when no column qualifier is provided, I aggregate 
all the values in that family. So, a sum for such a case is group sum of all 
CQ's values for one row key. I think it is a right approach. Please advice here.

d) Now in case of rowcount, one can use FirstKeyValueFilter for optimisation. 
But it may give wrong result in case user has also provided a column qualifier. 
In such a case, the first value returned by the scanner might belong to other 
qualifier, but the FirstKeyValueFilter will set its flag to skip to next row, 
but that value is filtered out from the result set. Its overall effect is that 
row is not counted and scanner moves to the next row. I used this only when 
there is no column qualifier. ( I confirmed this during my testing, but will be 
good to have some comments here).

d) As suggested, I have added bunch of boundary test cases for each of the six 
agg functions. Please let me know in case some more are to be added.

e) Yes, its the client (here AggregationtClient), that will perform the "reduce 
phase", where individual results from all the target regions are received and 
accumulated.



> Coprocessors: Support aggregate functions
> -----------------------------------------
>
>                 Key: HBASE-1512
>                 URL: https://issues.apache.org/jira/browse/HBASE-1512
>             Project: HBase
>          Issue Type: Sub-task
>          Components: coprocessors
>            Reporter: stack
>         Attachments: 1512.zip, patch-1512.txt
>
>
> Chatting with jgray and holstad at the kitchen table about counts, sums, and 
> other aggregating facility, facility generally where you want to calculate 
> some meta info on your table, it seems like it wouldn't be too hard making a 
> filter type that could run a function server-side and return the result ONLY 
> of the aggregation or whatever.
> For example, say you just want to count rows, currently you scan, server 
> returns all data to client and count is done by client counting up row keys.  
> A bunch of time and resources have been wasted returning data that we're not 
> interested in.  With this new filter type, the counting would be done 
> server-side and then it would make up a new result that was the count only 
> (kinda like mysql when you ask it to count, it returns a 'table' with a count 
> column whose value is count of rows).   We could have it so the count was 
> just done per region and return that.  Or we could maybe make a small change 
> in scanner too so that it aggregated the per-region counts.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1512) Coprocessors: Support aggregate functions

Reply via email to