Re: Need help understanding Hadoop Architecture

panamamike Wed, 26 Oct 2011 06:35:18 -0700

oleksiy wrote:
> 
> Hello,
> 
> I would suggest you to read at least this piece of info: 
> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
> HDFS Architecture 
> 
> 
> This is the main part of HDFS  architecture. There you can find some info
> of how client read data from different nodes. 
> Also I would suggest good book "
> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732 Tom
> White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
> There you definitely find all answers on your questions. 
> 
> Regards,
> Oleksiy
> 
> 
> panamamike wrote:
>> 
>> I'm new to Hadoop.  I've read a few articles and presentations which are
>> directed at explaining what Hadoop is, and how it works.  Currently my
>> understanding is Hadoop is an MPP system which leverages the use of large
>> block size to quickly find data.  In theory, I understand how a large
>> block size along with an MPP architecture as well as using what I'm
>> understanding to be a massive index scheme via mapreduce can be used to
>> find data.
>> 
>> What I don't understand is how ,after you identify the appropriate 64MB
>> blocksize, do you find the data you're specifically after?  Does this
>> mean the CPU has to search the entire 64MB block for the data of
>> interest?  If so, how does Hadoop know what data from that block to
>> retrieve?
>> 
>> I'm assuming the block is probably composed of one or more files.  If
>> not, I'm assuming the user isn't look for the entire 64MB block rather a
>> portion of it.
>> 
>> Any help indicating documentation, books, articles on the subject would
>> be much appreciated.
>> 
>> Regards,
>> 
>> Mike
>> 
> 
> 

Oleksiy,

Thank you for your input, I've actually ready that section of the Hadoop
documentation.  I think it does a good job of describing the general
architecture of how Hadoop works.  The description reminds me of how the
Teradata MPP architecture.  The thing that I'm missing, is how does Hadoop
look find things?

I see how Hadoop can potentially narrow searches down by leveraging the
concept of using metadata indexes to find the large 64MB blocks, I'm calling
these large since typical blocks are measured in bytes, however, when it
does find this block, how does it search within the block?  Does it then get
down to a brute force type search of the 64MB, and because systems are just
fast enough these days that search isn't a big deal?

Going back to my comparison to Teradata, teradata had a weakness in that the
speed of the MPP architecture was dependant on the quality of the data
distribution index.  Meaning, there had to be a way for the system to
determine how to store data across the commodity hardware in order to have a
even distribution.  If the distribution isn't even, meaning based on the
index defined most data goes to one node in the system, you get something
call hot amping where the MPP advantage is lost because the majority of the
work is being directed to the one node.

How does Hadoop tacking this particular issue?  Really, when it comes down
to it, how does hadoop distribute the data, balance load data as well as
keep up the parallel performance?  This gets back to my question of how does
Hadoop find things quickly?  I know in Teradata, it's based on the design of
the main index.  My assumption is that Hadoop does something similar with
the metadata, but then that means unstructured data would have to be
associated to some sort of metadata tags.

Futhermore, that unstructured data could only be found if the correct
metadata keys values are searched.  Is this the way it works?

Mike
-- 
View this message in context: 
http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32724383.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Need help understanding Hadoop Architecture

Reply via email to