[
https://issues.apache.org/jira/browse/HBASE-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181366#comment-13181366
]
Royston Sellman commented on HBASE-5123:
----------------------------------------
Re: 5123 I have also had some time to think about other aggregation functions
(Please be aware that I am new to HBase, Coprocessors, and the Aggregation
Protocol and I have little knowledge of distributed numerical algorithms!). It
seems to me the pattern in AP is to return a SINGLE value from a SINGLE column
(CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE
values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE
value/SINGLE column (SVSC) case.
So, common SVSC aggregation functions:
currently supported:
min
max
sum
count
avg (arithmetic mean)
std
not currently supported:
median
mode
quantile/ntile
mult/product
for column values of all numeric types, returning values of that type. Current
support is only for Long type.
Some thoughts on the future possibilities:
An example of a future SINGLE value MULTIPLE column use case could be weighted
versions of the above functions i.e. a column of weights applied to the column
of values then the new aggregation derived.
(note: there is a very good description of Weighted Median in the R language
documentation:
http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
An example of future MULTIPLE value SINGLE column could be range: return all
rows with a column value between two values. Maybe this is a bad example
because there could be better HBase ways to do it with filters/scans at a
higher level. Perhaps binning is a better example? i.e. return an array
containing values derived from applying one of the SVSC functions to a binned
column e.g:
int bins = 100;
aClient.sum(table, ci, scan, bins); => {12.3, 14.5...}
Another example (common in several programming languages) is to map an
arbitrary function over a column and return the new vector. Of course, again
this may be a bad example in the case of long HBase columns but it seems like
an appropriate thing to do with coprocessors.
MULTIPLE value MULTIPLE column examples are common in spatial data processing
but I see there has been a lot of spatial/GIS discussion around HBase which I
have not read yet. So I'll keep quiet for now.
I hope these thoughts strike a balance between my (special interest) use case
of statistical/spatial functions on tables and general purpose (but coprocessor
enabled/regionserver distributed) HBase.
> Provide more aggregate functions for Aggregations Protocol
> ----------------------------------------------------------
>
> Key: HBASE-5123
> URL: https://issues.apache.org/jira/browse/HBASE-5123
> Project: HBase
> Issue Type: Improvement
> Reporter: Zhihong Yu
>
> Royston requested the following aggregates on top of what we already have:
> Median, Weighted Median, Mult
> See discussion entitled 'AggregateProtocol Help' on user list
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira