On 18/09/2014 14:12, Alec Ten Harmsel wrote:
On 09/18/2014 05:17 AM, Kerin Millar wrote:
On 17/09/2014 21:20, Alec Ten Harmsel wrote:
As far as HDFS goes, I would only set that up if you will use it for
Hadoop or related tools. It's highly specific, and the performance is
not good unless you're doing a massively parallel read (what it was
designed for). I can elaborate why if anyone is actually interested.
I, for one, am very interested.
--Kerin
Alright, here goes:
Rich Freeman wrote:
FYI - one very big limitation of hdfs is its minimum filesize is
something huge like 1MB or something like that. Hadoop was designed
to take a REALLY big input file and chunk it up. If you use hdfs to
store something like /usr/portage it will turn into the sort of
monstrosity that you'd actually need a cluster to store.
This is exactly correct, except we run with a block size of 128MB, and a large
cluster will typically have a block size of 256MB or even 512MB.
HDFS has two main components: a NameNode, which keeps track of which blocks are
a part of which file (in memory), and the DataNodes that actually store the
blocks. No data ever flows through the NameNode; it negotiates transfers
between the client and DataNodes and negotiates transfers for jobs. Since the
NameNode stores metadata in-memory, small files are bad because RAM gets wasted.
What exactly is Hadoop/HDFS used for? The most common uses are generating
search indices on data (which is a batch job) and doing non-realtime processing
of log streams and/or data streams (another batch job) and allowing a large
number of analysts run disparate queries on the same large dataset (another
batch job). Batch processing - processing the entire dataset - is really where
Hadoop shines.
When you put a file into HDFS, it gets split based on the block size. This is
done so that a parallel read will be really fast - each map task reads in a
single block and processes it. Ergo, if you put in a 1GB file with a 128MB
block size and run a MapReduce job, 8 map tasks will be launched. If you put in
a 1TB file, 8192 tasks would be launched. Tuning the block size is important to
optimize the overhead of launching tasks vs. potentially under-utilizing a
cluster. Typically, a cluster with a lot of data has a bigger block size.
The downsides of HDFS:
* Seeked reads are not supported afaik because no one needs that for batch
processing
* Seeked writes into an existing file are not supported because either blocks
would be added in the middle of a file and wouldn't be 128MB, or existing
blocks would be edited, resulting in blocks larger than 128MB. Both of these
scenarios are bad.
Since HDFS users typically do not need seeked reads or seeked writes, these
downsides aren't really a big deal.
If something's not clear, let me know.
Thank you for taking the time to explain.
--Kerin