Re: Large data sets

Bryan A. P. Pendleton Fri, 02 Feb 2007 12:21:45 -0800

We have a cluster of about 40 nodes, with about 14Tb of aggregate raw
storage.. At peak times, I have had up to 3 or 4 terabytes of data stored in
HDFS, stored in probably 100-200k files.


To make things work for my tasks, I had to hash through a few different
tricks for dealing with large sets of data - not all of the tools you might
like for combining different sequential streams of data in Hadoop are
around. In particular, running MapReduce processes to re-key or variously
mix sequential inputs for further processing can be problematic when your
dataset is already taxing your storage. If you read through the history of
this list, you'll see that I'm often agitating about bugs in handling
low-disk-space conditions, storage balancing, and problems related to
numbers of simultaneous open files.

I haven't generally run into files-per-directory problems, because I
introduce my data into SequenceFile or MapFile formats as soon as possible,
then do work across segments of that. Storing individual low-record-count
files in HDFS is definitely a no-no given the current limits of the system.

Feel free to write me off-list if you want to know more particulars of how
I've been using the system.

On 2/2/07, Jim Kellerman <[EMAIL PROTECTED]> wrote:


I am part of a working group that is developing a Bigtable-like structured
storage system for Hadoop HDFS (see
http://wiki.apache.org/lucene-hadoop/Hbase).

I am interested in learning about large HDFS installations:

- How many nodes do you have in a cluster?

- How much data do you store in HDFS?

- How many files do you have in HDFS?

- Have you run into any limitations that have prevented you from growing
  your application?

- Are there limitations in how many files you can put in a single
directory?

  Google's GFS, for example does not really implement directories per-se,
  so it does not suffer from performance problems related to having too
  many files in a directory as traditional file systems do.

The largest system I know about has about 1.5M files and about 150GB of
data. If anyone has a larger system in use, I'd really like to hear from
you. Were there particular obstacles you had in growing your system to
that
size, etc?

Thanks in advance.
--
Jim Kellerman, Senior Engineer; Powerset                [EMAIL PROTECTED]




--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Re: Large data sets

Reply via email to