Re: Large data sets

Jim Kellerman Mon, 05 Feb 2007 15:02:19 -0800

the URL for the example got mangled by my email client. Here is the correct
URL:


http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture#example


On 2/5/07 2:44 PM, "Jim Kellerman" <[EMAIL PROTECTED]> wrote:

> Bryan,
> 
> Storing many low-record-count files is not what I am concerned about so much
> as, storing many high-record-count files. In the example
> (http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture#example),
> for one column of one table we are talking about 2M MapFiles to hold the
> data for one column of a particular table. Now they have to have some HDFS
> name, so does that mean they need to live in a HDFS directory or can HDFS
> files live outside the HDFS directory space even if they are to be
> persistent? (This is an area of HDFS I haven't explored much, so I don't
> know if we have to do something to make it work.) Certainly you could not
> put 2M files in a Unix directory and expect it to work. I'm just trying to
> understand if there are similar limitations in Hadoop.
> 
> Any light that could shed on the matter would be greatly appreciated.
> 
> Thanks.
> 
> -Jim
> 
> 
> On 2/2/07 12:21 PM, "Bryan A. P. Pendleton" <[EMAIL PROTECTED]> wrote:
> 
>> We have a cluster of about 40 nodes, with about 14Tb of aggregate raw
>> storage.. At peak times, I have had up to 3 or 4 terabytes of data stored in
>> HDFS, stored in probably 100-200k files.
>> 
>> To make things work for my tasks, I had to hash through a few different
>> tricks for dealing with large sets of data - not all of the tools you might
>> like for combining different sequential streams of data in Hadoop are
>> around. In particular, running MapReduce processes to re-key or variously
>> mix sequential inputs for further processing can be problematic when your
>> dataset is already taxing your storage. If you read through the history of
>> this list, you'll see that I'm often agitating about bugs in handling
>> low-disk-space conditions, storage balancing, and problems related to
>> numbers of simultaneous open files.
>> 
>> I haven't generally run into files-per-directory problems, because I
>> introduce my data into SequenceFile or MapFile formats as soon as possible,
>> then do work across segments of that. Storing individual low-record-count
>> files in HDFS is definitely a no-no given the current limits of the system.
>> 
>> Feel free to write me off-list if you want to know more particulars of how
>> I've been using the system.
>> 
>> On 2/2/07, Jim Kellerman <[EMAIL PROTECTED]> wrote:
>>> 
>>> I am part of a working group that is developing a Bigtable-like structured
>>> storage system for Hadoop HDFS (see
>>> http://wiki.apache.org/lucene-hadoop/Hbase).
>>> 
>>> I am interested in learning about large HDFS installations:
>>> 
>>> - How many nodes do you have in a cluster?
>>> 
>>> - How much data do you store in HDFS?
>>> 
>>> - How many files do you have in HDFS?
>>> 
>>> - Have you run into any limitations that have prevented you from growing
>>>   your application?
>>> 
>>> - Are there limitations in how many files you can put in a single
>>> directory?
>>> 
>>>   Google's GFS, for example does not really implement directories per-se,
>>>   so it does not suffer from performance problems related to having too
>>>   many files in a directory as traditional file systems do.
>>> 
>>> The largest system I know about has about 1.5M files and about 150GB of
>>> data. If anyone has a larger system in use, I'd really like to hear from
>>> you. Were there particular obstacles you had in growing your system to
>>> that
>>> size, etc?
>>> 
>>> Thanks in advance.
>>> --
>>> Jim Kellerman, Senior Engineer; Powerset                [EMAIL PROTECTED]
>>> 
>> 
>> 

-- 
Jim Kellerman, Senior Engineer; Powerset                [EMAIL PROTECTED]

Re: Large data sets

Reply via email to