Re: Large data sets

Jim Kellerman Mon, 05 Feb 2007 14:45:11 -0800

Bryan,

Storing many low-record-count files is not what I am concerned about so much
as, storing many high-record-count files. In the example
(http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture#example),
for one column of one table we are talking about 2M MapFiles to hold the
data for one column of a particular table. Now they have to have some HDFS
name, so does that mean they need to live in a HDFS directory or can HDFS
files live outside the HDFS directory space even if they are to be
persistent? (This is an area of HDFS I haven't explored much, so I don't
know if we have to do something to make it work.) Certainly you could not
put 2M files in a Unix directory and expect it to work. I'm just trying to
understand if there are similar limitations in Hadoop.


Any light that could shed on the matter would be greatly appreciated.

Thanks.

-Jim


On 2/2/07 12:21 PM, "Bryan A. P. Pendleton" <[EMAIL PROTECTED]> wrote:

> We have a cluster of about 40 nodes, with about 14Tb of aggregate raw
> storage.. At peak times, I have had up to 3 or 4 terabytes of data stored in
> HDFS, stored in probably 100-200k files.
> 
> To make things work for my tasks, I had to hash through a few different
> tricks for dealing with large sets of data - not all of the tools you might
> like for combining different sequential streams of data in Hadoop are
> around. In particular, running MapReduce processes to re-key or variously
> mix sequential inputs for further processing can be problematic when your
> dataset is already taxing your storage. If you read through the history of
> this list, you'll see that I'm often agitating about bugs in handling
> low-disk-space conditions, storage balancing, and problems related to
> numbers of simultaneous open files.
> 
> I haven't generally run into files-per-directory problems, because I
> introduce my data into SequenceFile or MapFile formats as soon as possible,
> then do work across segments of that. Storing individual low-record-count
> files in HDFS is definitely a no-no given the current limits of the system.
> 
> Feel free to write me off-list if you want to know more particulars of how
> I've been using the system.
> 
> On 2/2/07, Jim Kellerman <[EMAIL PROTECTED]> wrote:
>> 
>> I am part of a working group that is developing a Bigtable-like structured
>> storage system for Hadoop HDFS (see
>> http://wiki.apache.org/lucene-hadoop/Hbase).
>> 
>> I am interested in learning about large HDFS installations:
>> 
>> - How many nodes do you have in a cluster?
>> 
>> - How much data do you store in HDFS?
>> 
>> - How many files do you have in HDFS?
>> 
>> - Have you run into any limitations that have prevented you from growing
>>   your application?
>> 
>> - Are there limitations in how many files you can put in a single
>> directory?
>> 
>>   Google's GFS, for example does not really implement directories per-se,
>>   so it does not suffer from performance problems related to having too
>>   many files in a directory as traditional file systems do.
>> 
>> The largest system I know about has about 1.5M files and about 150GB of
>> data. If anyone has a larger system in use, I'd really like to hear from
>> you. Were there particular obstacles you had in growing your system to
>> that
>> size, etc?
>> 
>> Thanks in advance.
>> --
>> Jim Kellerman, Senior Engineer; Powerset                [EMAIL PROTECTED]
>> 
> 
> 

-- 
Jim Kellerman, Senior Engineer; Powerset                [EMAIL PROTECTED]

Re: Large data sets

Reply via email to