Brian (and others):

Great info...thanks!

> I would suggest looking into at Cloudera's blog posting about the "small
> files problem":
> 
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

Good link...muchos gracias.

> The simplest thing you could do is to use the Hadoop ARchive format
> (HAR) in a pre-processing step.  The best thing you could do is to have
> a pre-processing step based on sequence file (note: either Oozie or
> Cascading are great workflow systems to help you out).

That doesn't work, since some of our files are updated every night.

> When you say "update" nightly, do you mean "add new files" or "update
> existing files"?  If you really mean changing existing files, HBase
> might be good for you -

We have to change existing files...and add some new ones as well.  So HAR won't 
really cut it for us.

> http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture.  HBase natively
> includes the concept of a timestamp, so you will be able to run against
> the the "latest version" or be able to specify a fixed version (in case
> if you want to repeat a night's analysis).

HBase sounds like the best approach for us, having digested the replies and 
also read the Cloudera link.

The question this raises is how to deploy Hadoop and HBase together?

What I would like to do is to have Hadoop data nodes also running HBase 
regionservers on the same machine if this is
feasible, that way when a Map/Reduce job runs, the data it will access should 
be local to the machine (eg. local HBase
region), at least in theory, right?

Is this doable/advisable?  Anyone done this before...that is, having a 
Hadoop/HDFS data node running on the same machine
as a HBase regionserver, where a mapred job running on the Hadoop node will 
access local data on that machine?  Config
advice to do this?

Thx!

> If you have separate archival storage for your data, you can also
> consider reloading every night; this might be scalable up to a few TB.

I would rather avoid that...since the data sources are remote in many cases, so 
aggregating a whole transmission every
night is probably not workable.


-- 
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

Reply via email to