Re: Read files from hdfs

Hassen Riahi Thu, 12 May 2011 04:09:04 -0700

Thanks for the reply.

Maybe I was not clear enough when explaining the use-case...Sorry forthat.


Assuming:

1- we are not running map reduce jobs
2- the read from hdfs is sequential
3- the network is not heavily used

I want to read 1 file remotely from a distributed filesystem, I have 2alternatives:


1- reading it from HDFS

2- reading it from a usual distributed filesystem (which have storedthe file in the same machine, rather splitting it in blocks and thendistribute them as hdfs did)

1 could get slower than 2 since 1 is introducing more overhead than 2(at each new hdfs block to read, it is needed to establish theconnexion with the datanode containing this block...)


Is it right?

Hassen

Yes it could get slower cause the operation would now involve a disk
read AND a network transfer (with other little overheads it carries
along).

2011/5/12 Hassen Riahi <hassen.ri...@cern.ch>:
Thank you Elton and Stanley for your reply.
Given that we are not running map reduce jobs (at least until now) +
assuming that the read is sequential + in case where the network isnotheavily used, I'll wait to see in general a degradation ofperformance when
reading 1 file from hdfs (hdfs blocks will be read sequentially from
different datanodes) compared to reading it from a usualfilesystems (which
store file without splitting it). is it right?
Thanks,
Hassen

Hassen,
Read in hdfs is sequential, i.e. read one block after another. Eachtime the
client will connect to one data node to read a block. Then connect to
another (or the same) data node to read next block.
The reason for this sequential design, I guess, is avoiding n/wtraffic
explosion in a heavy map reduce job.
-Elton

2011/5/8 <stanley....@emc.com>
To my understanding, the reader read file blocks in parallel.

-----Original Message-----
From: Hassen Riahi [mailto:hassen.ri...@cern.ch]
Sent: 2011年5月7日 23:50
To: hdfs-user@hadoop.apache.org
Subject: Read files from hdfs

Hi all,

is the read operation of 1 file stored in hdfs done in parallel?
I mean let's say that I have 1 file split in 2 blocks (hdfs block)and
each block is stored in 1 rack.
When reading this file, both blocks are read in parallel? or thefirstblock is read and then once done the read of the second blockbegins?
If the later is right, the read of files in hdfs is then sequential.
is it right or am I missing something?

Thanks,
Hassen
--
Harsh J

Re: Read files from hdfs

Reply via email to