[
https://issues.apache.org/jira/browse/HBASE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Himanshu Vashishtha updated HBASE-1512:
---------------------------------------
Attachment: patch-1512.txt
a patch for initial agg functions. It has the functionalities for max, min,
sum, avg, rowcount, std. Please suggest further improvements.
I am looking for Top K & group by like queries. Gary suggested a scanner
'like' functionality for Coprocessor for such queries to reduce ipc, which
seems very relevant.
There are few question here: Is it like the proposed scanner should run on a
precomputed resultset and its purpose is just to keep ipc in control by sending
a fixed number of rows (cache limit set by client)
OR
it should process a fixed number of raw rows from the table (in the default
access order) and send its result on the fly (processing means executing the
coprocessor code)
The current scanner functionality that a client uses (RegionScanner) registers
itself at the region server level and keep its state there. Calling next() from
the client (HTable) results invoking next() on the registered scanner. So, it
uses the second option as it is navigating in the table as such.
There are some common links between coprocessors and current scanner
implementations like: with coprocessor, one can intercept the result after
every call to next (preScannerNext postScannerNext) and a coprocessor impl can
massage the data accordingly. But this is not the purpose of ROs, as it will
break th abstraction of RO's like they will be invoked in every client call in
that case (inputs from Gary on irc today: just mentioning here for reference).
Still the above Q holds good: whether cp-scanner should navigate through a
computed result set or through raw table rows & invoke the CP impl (essentially
an EndPoint impl) there by.
It can be use case specific. It needs more thought.
> Coprocessors: Support aggregate functions
> -----------------------------------------
>
> Key: HBASE-1512
> URL: https://issues.apache.org/jira/browse/HBASE-1512
> Project: HBase
> Issue Type: Sub-task
> Reporter: stack
> Attachments: 1512.zip, patch-1512.txt
>
>
> Chatting with jgray and holstad at the kitchen table about counts, sums, and
> other aggregating facility, facility generally where you want to calculate
> some meta info on your table, it seems like it wouldn't be too hard making a
> filter type that could run a function server-side and return the result ONLY
> of the aggregation or whatever.
> For example, say you just want to count rows, currently you scan, server
> returns all data to client and count is done by client counting up row keys.
> A bunch of time and resources have been wasted returning data that we're not
> interested in. With this new filter type, the counting would be done
> server-side and then it would make up a new result that was the count only
> (kinda like mysql when you ask it to count, it returns a 'table' with a count
> column whose value is count of rows). We could have it so the count was
> just done per region and return that. Or we could maybe make a small change
> in scanner too so that it aggregated the per-region counts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.