On 09/18/2014 05:17 AM, Kerin Millar wrote:
> On 17/09/2014 21:20, Alec Ten Harmsel wrote:
>> As far as HDFS goes, I would only set that up if you will use it for
>> Hadoop or related tools. It's highly specific, and the performance is
>> not good unless you're doing a massively parallel read (what it was
>> designed for). I can elaborate why if anyone is actually interested.
>
> I, for one, am very interested.
>
> --Kerin
>

Alright, here goes:

Rich Freeman wrote:

> FYI - one very big limitation of hdfs is its minimum filesize is
> something huge like 1MB or something like that.  Hadoop was designed
> to take a REALLY big input file and chunk it up.  If you use hdfs to
> store something like /usr/portage it will turn into the sort of
> monstrosity that you'd actually need a cluster to store.

This is exactly correct, except we run with a block size of 128MB, and a large 
cluster will typically have a block size of 256MB or even 512MB.

HDFS has two main components: a NameNode, which keeps track of which blocks are 
a part of which file (in memory), and the DataNodes that actually store the 
blocks. No data ever flows through the NameNode; it negotiates transfers 
between the client and DataNodes and negotiates transfers for jobs. Since the 
NameNode stores metadata in-memory, small files are bad because RAM gets wasted.

What exactly is Hadoop/HDFS used for? The most common uses are generating 
search indices on data (which is a batch job) and doing non-realtime processing 
of log streams and/or data streams (another batch job) and allowing a large 
number of analysts run disparate queries on the same large dataset (another 
batch job). Batch processing - processing the entire dataset - is really where 
Hadoop shines.

When you put a file into HDFS, it gets split based on the block size. This is 
done so that a parallel read will be really fast - each map task reads in a 
single block and processes it. Ergo, if you put in a 1GB file with a 128MB 
block size and run a MapReduce job, 8 map tasks will be launched. If you put in 
a 1TB file, 8192 tasks would be launched. Tuning the block size is important to 
optimize the overhead of launching tasks vs. potentially under-utilizing a 
cluster. Typically, a cluster with a lot of data has a bigger block size.

The downsides of HDFS:
* Seeked reads are not supported afaik because no one needs that for batch 
processing
* Seeked writes into an existing file are not supported because either blocks 
would be added in the middle of a file and wouldn't be 128MB, or existing 
blocks would be edited, resulting in blocks larger than 128MB. Both of these 
scenarios are bad.

Since HDFS users typically do not need seeked reads or seeked writes, these 
downsides aren't really a big deal.

If something's not clear, let me know.

Alec


Reply via email to