Thanks for the reply.
Maybe I was not clear enough when explaining the use-case...Sorry for
that.
Assuming:
1- we are not running map reduce jobs
2- the read from hdfs is sequential
3- the network is not heavily used
I want to read 1 file remotely from a distributed filesystem, I have 2
alternatives:
1- reading it from HDFS
2- reading it from a usual distributed filesystem (which have stored
the file in the same machine, rather splitting it in blocks and then
distribute them as hdfs did)
1 could get slower than 2 since 1 is introducing more overhead than 2
(at each new hdfs block to read, it is needed to establish the
connexion with the datanode containing this block...)
Is it right?
Hassen
Yes it could get slower cause the operation would now involve a disk
read AND a network transfer (with other little overheads it carries
along).
2011/5/12 Hassen Riahi <hassen.ri...@cern.ch>:
Thank you Elton and Stanley for your reply.
Given that we are not running map reduce jobs (at least until now) +
assuming that the read is sequential + in case where the network is
not
heavily used, I'll wait to see in general a degradation of
performance when
reading 1 file from hdfs (hdfs blocks will be read sequentially from
different datanodes) compared to reading it from a usual
filesystems (which
store file without splitting it). is it right?
Thanks,
Hassen
Hassen,
Read in hdfs is sequential, i.e. read one block after another. Each
time the
client will connect to one data node to read a block. Then connect to
another (or the same) data node to read next block.
The reason for this sequential design, I guess, is avoiding n/w
traffic
explosion in a heavy map reduce job.
-Elton
2011/5/8 <stanley....@emc.com>
To my understanding, the reader read file blocks in parallel.
-----Original Message-----
From: Hassen Riahi [mailto:hassen.ri...@cern.ch]
Sent: 2011年5月7日 23:50
To: hdfs-user@hadoop.apache.org
Subject: Read files from hdfs
Hi all,
is the read operation of 1 file stored in hdfs done in parallel?
I mean let's say that I have 1 file split in 2 blocks (hdfs block)
and
each block is stored in 1 rack.
When reading this file, both blocks are read in parallel? or the
first
block is read and then once done the read of the second block
begins?
If the later is right, the read of files in hdfs is then sequential.
is it right or am I missing something?
Thanks,
Hassen
--
Harsh J