the URL for the example got mangled by my email client. Here is the correct URL:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture#example On 2/5/07 2:44 PM, "Jim Kellerman" <[EMAIL PROTECTED]> wrote: > Bryan, > > Storing many low-record-count files is not what I am concerned about so much > as, storing many high-record-count files. In the example > (http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture#example), > for one column of one table we are talking about 2M MapFiles to hold the > data for one column of a particular table. Now they have to have some HDFS > name, so does that mean they need to live in a HDFS directory or can HDFS > files live outside the HDFS directory space even if they are to be > persistent? (This is an area of HDFS I haven't explored much, so I don't > know if we have to do something to make it work.) Certainly you could not > put 2M files in a Unix directory and expect it to work. I'm just trying to > understand if there are similar limitations in Hadoop. > > Any light that could shed on the matter would be greatly appreciated. > > Thanks. > > -Jim > > > On 2/2/07 12:21 PM, "Bryan A. P. Pendleton" <[EMAIL PROTECTED]> wrote: > >> We have a cluster of about 40 nodes, with about 14Tb of aggregate raw >> storage.. At peak times, I have had up to 3 or 4 terabytes of data stored in >> HDFS, stored in probably 100-200k files. >> >> To make things work for my tasks, I had to hash through a few different >> tricks for dealing with large sets of data - not all of the tools you might >> like for combining different sequential streams of data in Hadoop are >> around. In particular, running MapReduce processes to re-key or variously >> mix sequential inputs for further processing can be problematic when your >> dataset is already taxing your storage. If you read through the history of >> this list, you'll see that I'm often agitating about bugs in handling >> low-disk-space conditions, storage balancing, and problems related to >> numbers of simultaneous open files. >> >> I haven't generally run into files-per-directory problems, because I >> introduce my data into SequenceFile or MapFile formats as soon as possible, >> then do work across segments of that. Storing individual low-record-count >> files in HDFS is definitely a no-no given the current limits of the system. >> >> Feel free to write me off-list if you want to know more particulars of how >> I've been using the system. >> >> On 2/2/07, Jim Kellerman <[EMAIL PROTECTED]> wrote: >>> >>> I am part of a working group that is developing a Bigtable-like structured >>> storage system for Hadoop HDFS (see >>> http://wiki.apache.org/lucene-hadoop/Hbase). >>> >>> I am interested in learning about large HDFS installations: >>> >>> - How many nodes do you have in a cluster? >>> >>> - How much data do you store in HDFS? >>> >>> - How many files do you have in HDFS? >>> >>> - Have you run into any limitations that have prevented you from growing >>> your application? >>> >>> - Are there limitations in how many files you can put in a single >>> directory? >>> >>> Google's GFS, for example does not really implement directories per-se, >>> so it does not suffer from performance problems related to having too >>> many files in a directory as traditional file systems do. >>> >>> The largest system I know about has about 1.5M files and about 150GB of >>> data. If anyone has a larger system in use, I'd really like to hear from >>> you. Were there particular obstacles you had in growing your system to >>> that >>> size, etc? >>> >>> Thanks in advance. >>> -- >>> Jim Kellerman, Senior Engineer; Powerset [EMAIL PROTECTED] >>> >> >> -- Jim Kellerman, Senior Engineer; Powerset [EMAIL PROTECTED]
