While an scp will copy data to the namenode machine, it does *not* store
the data in dfs, it simply copies the data to namenode machine. This
is the same as copying data to any other machine. The data isn't in DFS
and is not accessible from DFS. If the box running the namenode fails
you lose your data.
The reason put is slower is that the data is actually being stored into
the DFS on multiple machines in block format. It is then accessible
from programs accessing the DFS such as MR jobs.
Dennis
Prasad Pingali wrote:
Hello,
I observe that scp of data to the namenode is faster than actually putting
into dfs (all nodes coming from same switch and have same ethernet cards,
homogenous nodes)? I understand that "dfs -put" breaks the data into blocks
and then copies to datanodes, but shouldn't that be atleast as fast as
copying data to namenode from a single machine, if not faster?
thanks and regards,
Prasad Pingali,
IIIT Hyderabad.