Re: How do you store data...

Sean Owen Sat, 01 May 2010 15:46:02 -0700

If you're using Hadoop-based jobs in Mahout it certainly makes sense
to have your data on your HDFS cluster that serves the Hadoop cluster;
it has to be available on such a cluster.

So are you asking about how much to distribute the data? Replication
obviously costs more storage, but buys not only redundancy but also
perhaps performance: if the data copies are closer to the workers,
it's faster. It sounds like you have a small / local cluster, so this
may not be a factor.

I can tell you I replicate 1x for testing and debugging, and replicate
3x in production as a rule. This was the norm at Google FWIW; some key
data was distributed more but 3x was the default.

On Sat, May 1, 2010 at 7:23 PM, Steven Bourke <steven.bou...@ucd.ie> wrote:
> I'm working with large datasets and have limited hardware resources (Like 
> everyone else!)
>
> I was wondering what would people recommend for storing my data in when using 
> mahout. I've roughly 100gb of data right now, that will grow and shrink over 
> time. If I distribute the storage the maximum number of nodes I would have 
> access to is three.
>
> I guess this is really a 'how long is a piece of string' question, but would 
> still appreciate peoples experiences!
>
> My requirements would be speed!
>
> Steve
>
>

Re: How do you store data...

Reply via email to