Hi Zachary,

HBase is rolling out Coprocessors in the 0.92 release, and that could be
used for more real-time computations with smaller files (e.g., HBase rows
are typically a few KB, up to 10MB in practice). Coprocessors allow you to
associate code with table regions in HBase, so you can scan region data on
startup and receive a stream of all get/put requests to the region to
maintain per-region analytics. Here's a blog post:
http://hbaseblog.com/2010/11/30/hbase-coprocessors/ and associated JIRA:
https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

<https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel>You
can also check out Yahoo!'s s4 project, but that's more about performing
analytics on continuous, unbounded streams of data than processing a large
number of small files: http://s4.io/

<http://s4.io/>Best,

Mike

On Tue, Feb 1, 2011 at 1:48 PM, Russ Ferriday <russ.ferri...@gmail.com>wrote:

> Hi Zachary,
>
> Have you heard of Cassandra?
> You may be able to write processing nodes accessing data on Cassandra.
> Probably the easiest configuration is that on each node you have processing
> functions and a Cassandra node.   Then as you expand your computing cluster,
> you also expand your cassandra bandwidth.
> This is not optimal, but very practical for a small project/small team.
> --r
>
>
>
> On Tue, Feb 1, 2011 at 1:21 PM, Zachary Kozick <z...@omniar.com> wrote:
>
>> Hi all,
>>
>> I'm interested in creating a solution that leverages multiple computing
>> nodes in an EC2 or Rackspace cloud environment in order to
>> do massively parallelized processing in the context of serving HTTP
>> requests, meaning I want results to be aggregated within 1-4 seconds.
>>
>> From what I gather, Hadoop is designed for job-oriented tasks and the
>> minimum job completion time is 30 seconds.  Also HDFS is meant for storing
>> few large files, as opposed to many small files.
>>
>> My question is there a framework similar to hadoop that is designed more
>> for on-demand parallel computing?  What about a technology similar to HDFS
>> that is better at moving around small files and making them available to
>> slave nodes on demand?
>>
>
>

Reply via email to