Re: Merging of files back in hadoop

Robert Evans Wed, 04 May 2011 09:24:32 -0700

Sherya,

What we described is how the APIs work.  They stream the data to the client and 
hide the details of that from the user.  What you do with those APIs, like 
caching of the data, is up to you.  If you need to read in all of the data to 
work, like with displaying an image, then you can cache it all, just be careful 
that you don't run out of memory, because HDFS is designed to hold big data.  
But the HDFS APIs will not cache the data for you so you can seek back to the 
beginning and rerun through it quickly like with the video example.  I am not 
positive how much data it keeps in memory at any one point in time, but it is 
not likely to be more then a block.  If you want more caching functionality you 
need to do it in your code.  Hadoop HDFS and Map/Reduce are designed to work 
with big data, so they try very hard not to hold all of that data in memory at 
any point in time, and to process the data in parallel.  Yes Map/Reduce does 
not require HDFS, but it does require a File System that implements the proper 
client side APIs.  IBM and others have put Hadoop Map/Reduce on top of other 
distributed file systems, and yes it can also run off of the local file system 
too.  That just means that the someone wrote a new implementation of the HDFS 
Client APIs, and set the configs to point to it.


--Bobby Evans

On 5/4/11 12:38 AM, "Shreya Chakravarty" <shreya_chakrava...@persistent.co.in> 
wrote:

Hi Ayon, Bobby,

Thanks for the response.
You mentioned that we read only one block at a time and keep it in client's 
memory, but I have a few queries:
*        What if I am reading  an image,  the entire content has to be on the 
client to show the image.

*        What if I stream a video, even though the video is buffered and 
doesn't come at one go, after its buffered I can see the full video again. 
(Will it start bringing the blocks again from datanodes one by one)



Does Mapreduce need HDFS as a pre requisite ? After all it's a java program and 
I could actually run a Mapreduce on my local machine without HDFS.


Shreya Chakravarty | Team Lead - IBM
shreya_chakrava...@persistent.co.in <mailto:m...@persistent.co.in> | Cell: 
+91-9766310680 | Tel: +91-20-391-77809
Persistent Systems Ltd. | 20 Glorious Years | www.persistentsys.com 
<http://www.persistentsys.com/>


From: Stuti Awasthi
Sent: Wednesday, May 04, 2011 10:42 AM
To: Shreya Chakravarty
Subject: FW: Merging of files back in hadoop




From: Ayon Sinha [mailto:ayonsi...@yahoo.com]
Sent: Tuesday, April 19, 2011 9:24 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Merging of files back in hadoop


One thing to note is that the HDFS client code fetches the block directly from 
the datanode after obtaining the location info from the name node. That way the 
namenode does not become the bottleneck for all data transfers. The clients 
only get the information about the sequence and location from the name node, 
like Bobby mentioned.

-Ayon
See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
Also check out my Blog for answers to commonly asked questions. 
<http://dailyadvisor.blogspot.com>





________________________________

From: Robert Evans <ev...@yahoo-inc.com>
To: "hdfs-user@hadoop.apache.org" <hdfs-user@hadoop.apache.org>
Sent: Tue, April 19, 2011 6:37:13 AM
Subject: Re: Merging of files back in hadoop

Sherya,

The metadata is all stored in the name node.  It stores where all of the block 
are located and the order of the blocks in a file.   Data is merged as needed 
behind when you call methods on the instance of the java.io <http://java.io.In> 
.InputStream returned when calling open.  So, when you open a file for reading 
you are making a connection to one of the machines that has a copy of the first 
block of the file.  As you read the data and you finish with the first block 
the second block is then fetched for you from what ever machine has a copy of 
it and you continue until all blocks are read.  Typically in map/reduce each 
mapper, that is reading data will read one block, and possibly a little bit 
more from the start of the next block.  That way you never have all of the file 
in memory on any machine.  Typically they only process a small part of the 
block at a time, one key/value pair.  However there is nothing stopping you 
from doing something bad, and trying to cache the entire contents of the file 
in memory as you read it from the stream, except that you would eventually get 
an out of memory exception.

--
Bobby Evans

On 4/19/11 4:19 AM, "Shreya Chakravarty" <shreya_chakrava...@persistent.co.in> 
wrote:
Hi,

I have a query regarding how Hadoop merges the data back which has been split 
into blocks and stored on different nodes.
*       Where is the data merged as we say that the file can be so huge that it 
doesn't fit onto one machine

*       Where is the sequence maintained for merging it back.



Thanks and Regards,
Shreya Chakravarty

DISCLAIMER ========== This e-mail may contain privileged and confidential 
information which is the property of Persistent Systems Ltd. It is intended 
only for the use of the individual or entity to which it is addressed. If you 
are not the intended recipient, you are not authorized to read, retain, copy, 
print, distribute or use this message. If you have received this communication 
in error, please notify the sender and delete all copies of this message. 
Persistent Systems Ltd. does not accept any liability for virus infected mails.
DISCLAIMER ========== This e-mail may contain privileged and confidential 
information which is the property of Persistent Systems Ltd. It is intended 
only for the use of the individual or entity to which it is addressed. If you 
are not the intended recipient, you are not authorized to read, retain, copy, 
print, distribute or use this message. If you have received this communication 
in error, please notify the sender and delete all copies of this message. 
Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Merging of files back in hadoop

Reply via email to