On 18/09/2014 14:12, Alec Ten Harmsel wrote:

On 09/18/2014 05:17 AM, Kerin Millar wrote:
On 17/09/2014 21:20, Alec Ten Harmsel wrote:
As far as HDFS goes, I would only set that up if you will use it for
Hadoop or related tools. It's highly specific, and the performance is
not good unless you're doing a massively parallel read (what it was
designed for). I can elaborate why if anyone is actually interested.

I, for one, am very interested.

--Kerin


Alright, here goes:

Rich Freeman wrote:

FYI - one very big limitation of hdfs is its minimum filesize is
something huge like 1MB or something like that.  Hadoop was designed
to take a REALLY big input file and chunk it up.  If you use hdfs to
store something like /usr/portage it will turn into the sort of
monstrosity that you'd actually need a cluster to store.

This is exactly correct, except we run with a block size of 128MB, and a large 
cluster will typically have a block size of 256MB or even 512MB.

HDFS has two main components: a NameNode, which keeps track of which blocks are 
a part of which file (in memory), and the DataNodes that actually store the 
blocks. No data ever flows through the NameNode; it negotiates transfers 
between the client and DataNodes and negotiates transfers for jobs. Since the 
NameNode stores metadata in-memory, small files are bad because RAM gets wasted.

What exactly is Hadoop/HDFS used for? The most common uses are generating 
search indices on data (which is a batch job) and doing non-realtime processing 
of log streams and/or data streams (another batch job) and allowing a large 
number of analysts run disparate queries on the same large dataset (another 
batch job). Batch processing - processing the entire dataset - is really where 
Hadoop shines.

When you put a file into HDFS, it gets split based on the block size. This is 
done so that a parallel read will be really fast - each map task reads in a 
single block and processes it. Ergo, if you put in a 1GB file with a 128MB 
block size and run a MapReduce job, 8 map tasks will be launched. If you put in 
a 1TB file, 8192 tasks would be launched. Tuning the block size is important to 
optimize the overhead of launching tasks vs. potentially under-utilizing a 
cluster. Typically, a cluster with a lot of data has a bigger block size.

The downsides of HDFS:
* Seeked reads are not supported afaik because no one needs that for batch 
processing
* Seeked writes into an existing file are not supported because either blocks 
would be added in the middle of a file and wouldn't be 128MB, or existing 
blocks would be edited, resulting in blocks larger than 128MB. Both of these 
scenarios are bad.

Since HDFS users typically do not need seeked reads or seeked writes, these 
downsides aren't really a big deal.

If something's not clear, let me know.

Thank you for taking the time to explain.

--Kerin

Reply via email to