Hi,
I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop
Some of my jobs partially fails and in the error log I get output like
2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 copy failed:
attempt_201106231520_0190_m_000000_0 from worker1
2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: worker1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at
sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1458)
at java.security.AccessController.doPrivileged(Native Method)
at
sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1452)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1106)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Caused by: java.net.UnknownHostException: worker1
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:175)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
... 4 more
2011-06-24 08:45:05,772 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201106231520_0190_r_000000_0: Failed fetch #1 from
attempt_201106231520_0190_m_000000_0
The above basically said that my worker is unknown, but I can't really make
any sense of it. Other jobs running before, at the same time or after
completes fine without any error messages and without any changes on the
server. Also other reduce task in the same run has succeded. So it looks
like that my worker sometimes 'disappear' and can't be reached.
My current theory is that it only happens when there are a couple of jobs
running at the same time. Is that a plausible explanation
Would anybody have some suggestions how I could get more infomation from the
system, or point me in a direction where I should look(I'm also quite new to
hadoop)
Best Regards
Niels
--
BinaryConstructors ApS
Vestergade 10a, 4th
1456 Kbh K
Denmark
phone: +4529722259
web: http://www.binaryconstructors.dk
mail: [email protected]
skype: nielsboldt