If you're using Hadoop-based jobs in Mahout it certainly makes sense to have your data on your HDFS cluster that serves the Hadoop cluster; it has to be available on such a cluster.
So are you asking about how much to distribute the data? Replication obviously costs more storage, but buys not only redundancy but also perhaps performance: if the data copies are closer to the workers, it's faster. It sounds like you have a small / local cluster, so this may not be a factor. I can tell you I replicate 1x for testing and debugging, and replicate 3x in production as a rule. This was the norm at Google FWIW; some key data was distributed more but 3x was the default. On Sat, May 1, 2010 at 7:23 PM, Steven Bourke <steven.bou...@ucd.ie> wrote: > I'm working with large datasets and have limited hardware resources (Like > everyone else!) > > I was wondering what would people recommend for storing my data in when using > mahout. I've roughly 100gb of data right now, that will grow and shrink over > time. If I distribute the storage the maximum number of nodes I would have > access to is three. > > I guess this is really a 'how long is a piece of string' question, but would > still appreciate peoples experiences! > > My requirements would be speed! > > Steve > >