Hi Zachary, HBase is rolling out Coprocessors in the 0.92 release, and that could be used for more real-time computations with smaller files (e.g., HBase rows are typically a few KB, up to 10MB in practice). Coprocessors allow you to associate code with table regions in HBase, so you can scan region data on startup and receive a stream of all get/put requests to the region to maintain per-region analytics. Here's a blog post: http://hbaseblog.com/2010/11/30/hbase-coprocessors/ and associated JIRA: https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
<https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel>You can also check out Yahoo!'s s4 project, but that's more about performing analytics on continuous, unbounded streams of data than processing a large number of small files: http://s4.io/ <http://s4.io/>Best, Mike On Tue, Feb 1, 2011 at 1:48 PM, Russ Ferriday <russ.ferri...@gmail.com>wrote: > Hi Zachary, > > Have you heard of Cassandra? > You may be able to write processing nodes accessing data on Cassandra. > Probably the easiest configuration is that on each node you have processing > functions and a Cassandra node. Then as you expand your computing cluster, > you also expand your cassandra bandwidth. > This is not optimal, but very practical for a small project/small team. > --r > > > > On Tue, Feb 1, 2011 at 1:21 PM, Zachary Kozick <z...@omniar.com> wrote: > >> Hi all, >> >> I'm interested in creating a solution that leverages multiple computing >> nodes in an EC2 or Rackspace cloud environment in order to >> do massively parallelized processing in the context of serving HTTP >> requests, meaning I want results to be aggregated within 1-4 seconds. >> >> From what I gather, Hadoop is designed for job-oriented tasks and the >> minimum job completion time is 30 seconds. Also HDFS is meant for storing >> few large files, as opposed to many small files. >> >> My question is there a framework similar to hadoop that is designed more >> for on-demand parallel computing? What about a technology similar to HDFS >> that is better at moving around small files and making them available to >> slave nodes on demand? >> > >