Re: Seeking Someone to Review Hadoop Article

2008-10-23 Thread Mafish Liu
I'm interesting in it.

On Fri, Oct 24, 2008 at 6:31 AM, Tom Wheeler [EMAIL PROTECTED] wrote:

 Each month the developers at my company write a short article about a
 Java technology we find exciting. I've just finished one about Hadoop
 for November and am seeking a volunteer knowledgeable about Hadoop to
 look it over to help ensure it's both clear and technically accurate.

 If you're interested in helping me, please contact me offlist and I
 will send you the draft.  Meanwhile, you can get a feel for the length
 and general style of the articles from our archives:

   http://www.ociweb.com/articles/publications/jnb.html

 Thanks in advance,

 Tom Wheeler




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Why did I only get 2 live datanodes?

2008-10-16 Thread Mafish Liu
Hi, wei:
Did you have all your nodes have the same directory structures, the same
configuration files, the same hadoop data directory, and make sure you have
access to the data structure.

   You may have connection problems, MAKE SURE your slave node can ssh to
master without password. And TURN OFF FIREWALL in your master node.

-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Need help in hdfs configuration fully distributed way in Mac OSX...

2008-09-16 Thread Mafish Liu
Hi:
  You need to configure your nodes to ensure that node 1 can connect to node
2 without password.

On Tue, Sep 16, 2008 at 2:04 PM, souravm [EMAIL PROTECTED] wrote:

 Hi All,

 I'm facing a problem in configuring hdfs in a fully distributed way in Mac
 OSX.

 Here is the topology -

 1. The namenode is in machine 1
 2. There is 1 datanode in machine 2

 Now when I execute start-dfs.sh from machine 1, it connects to machine 2
 (after it asks for password for connecting to machine 2) and starts datanode
 in machine 2 (as the console message says).

 However -
 1. When I go to http://machine1:50070 - it does not show the data node at
 all. It says 0 data node configured
 2. In the log file in machine 2 what I see is -
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = rc0902b-dhcp169.apple.com/17.229.22.169
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.17.2.1
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r
 684969; compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
 /
 2008-09-15 18:54:44,626 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 1 time(s).
 2008-09-15 18:54:45,627 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 2 time(s).
 2008-09-15 18:54:46,628 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 3 time(s).
 2008-09-15 18:54:47,629 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 4 time(s).
 2008-09-15 18:54:48,630 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 5 time(s).
 2008-09-15 18:54:49,631 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 6 time(s).
 2008-09-15 18:54:50,632 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 7 time(s).
 2008-09-15 18:54:51,633 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 8 time(s).
 2008-09-15 18:54:52,635 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 9 time(s).
 2008-09-15 18:54:53,640 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /17.229.23.77:9000. Already tried 10 time(s).
 2008-09-15 18:54:54,641 INFO org.apache.hadoop.ipc.RPC: Server at /
 17.229.23.77:9000 not available yet, Z...

 ... and this retyring gets on repeating


 The  hadoop-site.xmls are like this -

 1. In machine 1
 -
 configuration

  property
namefs.default.name/name
valuehdfs://localhost:9000/value
  /property

   property
namedfs.name.dir/name
value/Users/souravm/hdpn/value
  /property

  property
namemapred.job.tracker/name
valuelocalhost:9001/value
  /property
  property
namedfs.replication/name
value1/value
  /property
 /configuration


 2. In machine 2

 configuration

  property
namefs.default.name/name
valuehdfs://machine1 ip:9000/value
  /property
  property
namedfs.data.dir/name
value/Users/nirdosh/hdfsd1/value
  /property
  property
namedfs.replication/name
value1/value
  /property
 /configuration

 The slaves file in machine 1 has single entry - user name@ip of
 machine2

 The exact steps I did -

 1. Reformat the namenode in machine 1
 2. execute start-dfs.sh in machine 1
 3. Then I try to see whether the datanode is created through http://machine
 1 ip:50070

 Any pointer to resolve this issue would be appreciated.

 Regards,
 Sourav



  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
 for the use of the addressee(s). If you are not the intended recipient,
 please
 notify the sender by e-mail and delete the original message. Further, you
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has
 taken
 every reasonable precaution to minimize this risk, but is not liable for
 any damage
 you may sustain as a result of any virus in this e-mail. You should carry
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves
 the
 right to monitor and review the content of all messages sent to or from
 this e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Small Filesizes

2008-09-15 Thread Mafish Liu
Hi,
  I'm just working on this situation you described, with millions of small
files sized around 10KB.
  My idea is to compact this files into big ones and create indexes for
them. This is a file system over file system and support append update, lazy
delete.
  May this help .

-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: specifying number of nodes for job

2008-09-07 Thread Mafish Liu
On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote:

 Hi,

 This may be a silly question, but I'm strangely having trouble finding an
 answer for it (perhaps I'm looking in the wrong places?).

 Suppose I have a cluster with n nodes each with m processors.

 I wish to test the performance of, say,  the wordcount program on k
 processors, where k is varied from k = 1 ... nm.


You can  specify the number of tasks for each node in your hadoop-site.xml
file.
So you can get k varied from k = n, 2*nm*n instead of k = 1...nm.


 How would I do this? I'm having trouble finding the proper command line
 option in the commands manual (
 http://hadoop.apache.org/core/docs/current/commands_manual.html)



 Thank you very much for you time.

 -SM




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Output directory already exists

2008-09-02 Thread Mafish Liu
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen [EMAIL PROTECTED] wrote:

 Hi,

 I'm trying to write the output of two different map-reduce jobs into the
 same output directory. I'm using MultipleOutputFormats to set the filename
 dynamically, so there is no filename collision between the two jobs.
 However, I'm getting the error output directory already exists.

 Does the framework support this functionality? It seems silly to have to
 create a temp directory to store the output files from the second job and
 then have to copy them to the first job's output directory after the second
 job completes.


Map/reduce will create output directory every time it runs and will fail if
the directory exists. Seems that there is no way to implement your
description other than modify the source code.



 Thanks,

 Shirley




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: basic questions about Hadoop!

2008-09-01 Thread Mafish Liu
On Sat, Aug 30, 2008 at 10:12 AM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Victor!

 I got problem with remote writing as well, so I tried to go further on this
 and I would like to share what I did, maybe you have more luck than me

 1) as I'm working with user gvelez in remote host I had to give write
 access
 to all, like this:

bin/hadoop dfs -chmod -R a+w input

 2) After that, there is no more connection refused error, but instead I got
 following exception



 $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
 cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
 08/08/29 19:06:51 INFO dfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException:
 jav
 a.io.IOException: File /user/hadoop/input/README.txt could only be
 replicated to
  0 nodes, instead of 1
at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
 va:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
 sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

 How many datanode do you have ? Only one, I guess.
Modify your $HADOOP_HOME/conf/hadoop-site.xml and lookup

property
namedfs.replication/name
value1/value
/property

set value to 0.



 On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov 
 [EMAIL PROTECTED]
  wrote:

  Jeff,
 
  Thanks for detailed instructions, but on machine that is not hadoop
 server
  I
  got error:
  ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
  08/08/29 19:33:07 INFO dfs.DFSClient: Exception in
 createBlockOutputStream
  java.net.ConnectException: Connection refused
  08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
  blk_-7622891475776838399
  The thing is that file was created, but with zero size.
 
  Do you have ideas why this happened?
 
  Thanks,
  Victor
 
  On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote:
 
   You can use the hadoop command line on machines that aren't hadoop
  servers.
   If you copy the hadoop configuration from one of your master servers or
   data
   node to the client machine and run the command line dfs tools, it will
  copy
   the files directly to the data node.
  
   Or, you could use one of the client libraries.  The java client, for
   example, allows you to open up an output stream and start dumping bytes
  on
   it.
  
   On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez 
 [EMAIL PROTECTED]
   wrote:
  
Hi Jeff, thank you for answering!
   
What about remote writing on HDFS, lets suppose I got an application
   server
on a
linux server A and I got a Hadoop cluster on servers B (master), C
   (slave),
D (slave)
   
What I would like is sent some files from Server A to be processed by
hadoop. So in order to do so, what I need to do do I need send
  those
files to master server first and then copy those to HDFS?
   
or can I pass those files to any slave server?
   
basically I'm looking for remote writing due to files to be process
 are
   not
being generated on any haddop server.
   
Thanks again!
   
-- Gerardo
   
   
   
Regarding
   
On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED]
  wrote:
   
 Gerardo:

 I can't really speak to all of your questions, but the master/slave
   issue
 is
 a common concern with hadoop.  A cluster has a single namenode and
 therefore
 a single point of failure.  There is also a secondary name node
  process
 which runs on the same machine as the name node in most default
 configurations.  You can make it a different machine by adjusting
 the
 master
 file.  One of the more experienced lurkers should feel free to
  correct
me,
 but my understanding is that the secondary name node keeps track of
  all
the
 same index information used by the primary name node.  So, if the
namenode
 fails, there is no automatic recovery, but you can always tweak
 your
 cluster
 configuration to make the secondary namenode the primary and safely
restart
 the cluster.

 As for the storage of files, the name node is really just the
 traffic
   cop
 for HDFS.  No HDFS files are actually stored on that machine.  It's
 basically used as a directory and lock manager, etc.  The files are
stored
 on multiple datanodes and I'm pretty sure all the actual file I/O
   happens
 directly between the client and the respective datanodes.

 Perhaps one of the more hardcore hadoop people on here will point
 it
   out
if
 I'm giving bad advice.


 On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez 
   [EMAIL PROTECTED]
 wrote:

 

Re: Help: how to check the active datanodes?

2008-07-03 Thread Mafish Liu
Hi, zhang:
   Once you start hadoop with shell start-all.sh, a hadoop status pape can
be accessed through http://namenode-ip:port/dfshealth. Port is specified by

namedfs.http.address/name
in your hadoop-default.xml.
If the datanodes status is not as expected, you need to check log files.
They show the details of failure.

On Fri, Jul 4, 2008 at 4:17 AM, Richard Zhang [EMAIL PROTECTED]
wrote:

 Hi guys:
 I am running hadoop on a 8 nodes cluster.  I uses start-all.sh to boot
 hadoop and it shows that all 8 data nodes are started. However, when I use
 bin/hadoop dfsadmin -report to check the status of the data nodes and it
 shows only one data node (the one with the same host as name node) is
 active. How could we know if all the data nodes are active precisely? Does
 anyone has deal with this before?
 Thanks.
 Richard




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: reduce task hanging or just slow?

2008-03-31 Thread Mafish Liu
Hi:
I have met the similar problem with you.  Finally, I found that this
problem was caused by the hostname resolution because hadoop use hostname to
access other nodes.
To fix this, try open your jobtracker log file( It often resides in
$HADOOP_HOME/logs/hadoop--jobtracker-.log ) to see if there is a
error:
FATAL org.apache.hadoop.mapred.JobTracker: java.net.UnknownHostException:
Invalid hostname for server: local
If, it is, adding ip-hostname pairs to /etc/hosts files on all of you
nodes may fix this problem.

Good luck and best regards.

Mafish

-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Reduce Hangs

2008-03-30 Thread Mafish Liu
All ports are listed in conf/hadoop-default.xml and conf/hadoop-site.xml.
Also, if you are using hbase, you need to concern about hbase-default.xmland
hbase-site.xml, located in hbase directory.

2008/3/29 Natarajan, Senthil [EMAIL PROTECTED]:

 Hi,
 Thanks for your suggestions.

 It looks like the problem is with firewall, I created the firewall rule to
 allow these ports 5 to 50100 (I found in these port range hadoop was
 listening)

 Looks like I am missing some ports and that gets blocked in the firewall.

 Could anyone please let me know, how to configure hadoop to use only
 certain specified ports, so that those ports can be allowed in the firewall.

 Thanks,
 Senthil




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Reduce Hangs

2008-03-27 Thread Mafish Liu
On Fri, Mar 28, 2008 at 12:31 AM, 朱盛凯 [EMAIL PROTECTED] wrote:

 Hi,

 I met this problem in my cluster before, I think I can share with you some
 of my experience.
 But it may not work in you case.

 The job in my cluster always hung at 16% of reduce. It occured because the
 reduce task could not fetch the
 map output from other nodes.

 In my case, two factors may result in this faliure of communication
 between
 two task trackers.

 One is the firewall block the trackers from communications. I solved this
 by
 disabling the firewall.
 The other factor is that trackers refer to other nodes by host name only,
 but not ip address. I solved this by editing the file /etc/hosts
 with mapping from hostname to ip address of all nodes in cluster.


I meet this problem with the same reason too.
Try to host names to all your /etc/hosts files .



 I hope my experience will be helpful for you.

 On 3/27/08, Natarajan, Senthil [EMAIL PROTECTED] wrote:
 
  Hi,
  I have small Hadoop cluster, one master and three slaves.
  When I try the example wordcount on one of our log file (size ~350 MB)
 
  Map runs fine but reduce always hangs (sometime around 19%,60% ...)
 after
  very long time it finishes.
  I am seeing this error
  Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
  In the log I am seeing this
  INFO org.apache.hadoop.mapred.TaskTracker:
  task_200803261535_0001_r_00_0 0.1834% reduce  copy (11 of 20 at
  0.02 MB/s) 
 
  Do you know what might be the problem.
  Thanks,
  Senthil
 
 




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: dynamically adding slaves to hadoop cluster

2008-03-09 Thread Mafish Liu
You should do the following steps:
1. Have hadoop deployed on the new node with the same directory structure
and configuration.
2. Just run $HADOOP_HOME/bin/hadoop datanode and jobtracker.

Datanode and jobtracker will contact to namenode specified in hadoop
configuration file automatically and finish adding new node to the hadoop
cluster.

On Mon, Mar 10, 2008 at 4:56 AM, Aaron Kimball [EMAIL PROTECTED] wrote:

 Yes. You should have the same hadoop-site across all your slaves. They
 will need to know the DNS name for the namenode and jobtracker.

 - Aaron

 tjohn wrote:
 
  Mahadev Konar wrote:
 
  I believe (as far as I remember) you should be able to add the node by
  bringing up the datanode or tasktracker on the remote machine. The
  Namenode or the jobtracker (I think) does not check for the nodes in
 the
  slaves file. The slaves file is just to start up all the daemon's by
  ssshing to all the nodes in the slaves file during startup. So you
  should just be able to startup the datanode pointing to correct
 namenode
  and it should work.
 
  Regards
  Mahadev
 
 
 
 
  Sorry for my ignorance... To make a datanode/tasktraker point to the
  namenode what should i do? Have i to edit the hadoop-site.xml? Thanks
 
  John
 
 




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: dynamically adding slaves to hadoop cluster

2008-03-09 Thread Mafish Liu
On Mon, Mar 10, 2008 at 9:47 AM, Mafish Liu [EMAIL PROTECTED] wrote:

 You should do the following steps:
 1. Have hadoop deployed on the new node with the same directory structure
 and configuration.
 2. Just run $HADOOP_HOME/bin/hadoop datanode and jobtracker.

Addition: do not run bin/hadoop namenode -format before you run datanode,
or you will get a error like Incompatible namespaceIDs ...



 Datanode and jobtracker will contact to namenode specified in hadoop
 configuration file automatically and finish adding new node to the hadoop
 cluster.


 On Mon, Mar 10, 2008 at 4:56 AM, Aaron Kimball [EMAIL PROTECTED]
 wrote:

  Yes. You should have the same hadoop-site across all your slaves. They
  will need to know the DNS name for the namenode and jobtracker.
 
  - Aaron
 
  tjohn wrote:
  
   Mahadev Konar wrote:
  
   I believe (as far as I remember) you should be able to add the node
  by
   bringing up the datanode or tasktracker on the remote machine. The
   Namenode or the jobtracker (I think) does not check for the nodes in
  the
   slaves file. The slaves file is just to start up all the daemon's by
   ssshing to all the nodes in the slaves file during startup. So you
   should just be able to startup the datanode pointing to correct
  namenode
   and it should work.
  
   Regards
   Mahadev
  
  
  
  
   Sorry for my ignorance... To make a datanode/tasktraker point to the
   namenode what should i do? Have i to edit the hadoop-site.xml? Thanks
  
   John
  
  
 



 --
 [EMAIL PROTECTED]
 Institute of Computing Technology, Chinese Academy of Sciences, Beijing.




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.