If you have a table with something like a billion rows, and do an aggregate function on the table from the shell, you will end up reading all billion rows through a single machine, essentially aggregating the entire dataset locally. This defeats the purpose of having a massively distributed database like HBase. To do this more efficiently, you'd ideally kick of a Map Reduce job that can perform the various aggregation function on the dataset in parallel, harnessing the power of the distributed dataset, and then returning the results to a central location once they are calculated.

I think putting this option into the shell is risky, because it will encourage people to think that the shell is a good way to interact with HBase in general, which it isn't. We want people to understand HBase is best consumed in parallel and discourage solutions that aggregate access through a single point. As such, we shouldn't build features that allow people to inadvertently use the wrong access patterns.

On Dec 5, 2007, at 3:38 PM, Edward Yoon (JIRA) wrote:


[ https://issues.apache.org/jira/browse/HADOOP-2006? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12548879 ]

Edward Yoon commented on HADOOP-2006:
-------------------------------------

I don't understand your comment.
Please more explanation for me.

Aggregate Functions in select statement
---------------------------------------

                Key: HADOOP-2006
URL: https://issues.apache.org/jira/browse/ HADOOP-2006
            Project: Hadoop
         Issue Type: Sub-task
         Components: contrib/hbase
   Affects Versions: 0.14.1
           Reporter: Edward Yoon
           Assignee: Edward Yoon
           Priority: Minor
            Fix For: 0.16.0


Aggregation functions on collections of data values: average, minimum, maximum, sum, count. Group rows by value of an columnfamily and apply aggregate function independently to each group of rows.
 * <Grouping columnfamilies>  ƒ ~function_list~ (Relation)
{code}
select producer, avg(year) from movieLog_table group by producer
{code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Reply via email to