Multiple HDFS clients

Usman Waheed Fri, 01 May 2009 04:23:24 -0700

Hi,

I just wanted to share a test we conducted in our small cluster of 3datanodes and one namenode. Basically we have lots of data to process andwe run a parsing script outside hadoop that creates the key,value pairs.This output which is plain txt files is then imported into hadoop usingthe put/get etc commands.

In order to speed up things we run the parsing jobs on multiple machinesin parallel which are not part of our cluster (3 datanodes + namenode) butthey do have the same version of hadoop installed as the cluster which weuse to perform the puts. This work flow has significantly improved ourtime to import the data into HADOOP after which we run the reduce-onlystep to aggregate.

Currently the way to insert data is through our namenode which all themachines outsude the cluster call them hdfs clients connect to and are notpart of the master/slave setup. I haven't tried but maybe we can performthese puts via the datanodes themselves and not just through the namenode?Right now the namenode is the single point through which the hdfs clientmachines insert the parsed data.

Secondly i would assume that this is a safe way to import parsed data intohadoop before we aggregate and will most likely not cause any datacorruption in HDFS. Granted anything can happen :).

It would be interesting to import our logs and perform the mapping stepinside HADOOP versus doing it outside. I wonder if the performance will bebetter, worse or the same. Yes this is dependent on many factors and oneof them is the amount of datanodes, data to process, hardware etc we havebut we are limited. We are trying to utilize machines outside the clusterwhich are idle and can process info and then insert the output into HADOOPHDFS via puts.


Your thoughts, comments, suggestions are welcome.

Thanks,
Usman




--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Multiple HDFS clients

Reply via email to