Hello,

I have been looking at Hadoop for awhile now and have been trying to get
0.4.0 to work with Nutch to do a small distributed crawl. Problem is,
whenever the task (fetching) nears completion, the job fails.

I am running 2 datanodes and 5 tasktracker nodes. One of the tasktracker
nodes has lots of entries in the log such as:

2006-07-25 00:02:56,690 INFO mapred.TaskRunner
(ReduceTaskRunner.java:copyOutput(240)) - task_0001_r_000024_2 done
copying task_0001_m_000001_0 output from fox10.nameprotect.com.

2006-07-25 00:02:59,656 WARN mapred.TaskRunner
(ReduceTaskRunner.java:copyOutput(246)) - task_0001_r_000038_2 failed to
copy task_0001_m_000001_0 output from fox10.nameprotect.com.

2006-07-25 00:02:59,657 WARN mapred.TaskRunner
(ReduceTaskRunner.java:run(210)) - task_0001_r_000038_2 copy failed:
task_0001_m_000001_0 from fox10.nameprotect.com

2006-07-25 00:02:59,657 WARN mapred.TaskRunner
(ReduceTaskRunner.java:run(212)) - java.net.ConnectException: Connection
timed out

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)

at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)

at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)

at java.net.Socket.connect(Socket.java:507)

at java.net.Socket.connect(Socket.java:457)

at sun.net.NetworkClient.doConnect(NetworkClient.java:157)

at sun.net.www.http.HttpClient.openServer(HttpClient.java:365)

at sun.net.www.http.HttpClient.openServer(HttpClient.java:477)

at sun.net.www.http.HttpClient.<init>(HttpClient.java:214)

at sun.net.www.http.HttpClient.New(HttpClient.java:287)

at sun.net.www.http.HttpClient.New(HttpClient.java:299)

at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConn
ection.java:784)

at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnecti
on.java:736)

at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.ja
va:661)

at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnec
tion.java:905)

at
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.jav
a:108)

at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(Red
uceTaskRunner.java:237)

at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTask
Runner.java:207)

 

 

So, looking at fox10, I see this in the log:

 

 

2006-07-25 00:03:15,821 WARN mapred.TaskTracker - Unknown child with bad
map output: task_0001_m_000001_0. Ignored.

2006-07-25 00:03:15,990 WARN mapred.TaskTracker - Http server
(getMapOutput.jsp): java.io.FileNotFoundException:
/index2/nutch/filesystem/mapreduce/local/task_0001_m_000001_0/

part-98.out

at
org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121)

at
org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.
java:47)

at
org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:229
)

at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:158)

at
org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
ava:64)

at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)

at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
andler.java:475)

at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)

at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
text.java:635)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)

at org.mortbay.http.HttpServer.service(HttpServer.java:954)

at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)

at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)

at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)

at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
)

at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)

at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

2006-07-25 00:03:15,990 WARN mapred.TaskTracker - Unknown child with bad
map output: task_0001_m_000001_0. Ignored.

2006-07-25 00:03:16,486 WARN mapred.TaskRunner - task_0001_r_000080_2
failed to copy task_0001_m_000001_0 output from fox10.nameprotect.com.

2006-07-25 00:03:16,486 WARN mapred.TaskRunner - task_0001_r_000080_2
copy failed: task_0001_m_000001_0 from fox10.nameprotect.com

2006-07-25 00:03:16,489 WARN mapred.TaskRunner -
java.net.ConnectException: Connection timed out

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)

at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)

at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)

at java.net.Socket.connect(Socket.java:507)

at java.net.Socket.connect(Socket.java:457)

at sun.net.NetworkClient.doConnect(NetworkClient.java:157)

at sun.net.www.http.HttpClient.openServer(HttpClient.java:365)

at sun.net.www.http.HttpClient.openServer(HttpClient.java:477)

at sun.net.www.http.HttpClient.<init>(HttpClient.java:214)

at sun.net.www.http.HttpClient.New(HttpClient.java:287)

at sun.net.www.http.HttpClient.New(HttpClient.java:299)

at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConn
ection.java:784)

at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnecti
on.java:736)

at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.ja
va:661)

at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnec
tion.java:905)

at
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.jav
a:108)

at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(Red
uceTaskRunner.java:237)

at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTask
Runner.java:207)

 



 

Funny thing is, the tasktrackers are all still up and running. Datanodes
seem fine, and so does the jobtracker. A small insert into the crawldb
works, as well as the generate job that made the segments I'm trying to
fetch. But something is happening with the fetching job. Does anyone
have any ideas what could be wrong?

Hadoop-site.xml for your reference:





<property>

<name>mapred.map.tasks</name>

<value>25</value>

<description>

define mapred.map tasks to be number of slave hosts

</description>

</property>

<property>

<name>mapred.reduce.tasks</name>

<value>25</value>

<description>

define mapred.reduce tasks to be number of slave hosts

</description>

</property>

<property>

<name>mapred.tasktracker.tasks.maximum</name>

<value>5</value>

<description>The maximum number of tasks that will be run

simultaneously by a task tracker.

</description>

</property>

 

<property>

<name>dfs.name.dir</name>

<value>/usr/local/nutch/name</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/index2/nutch/filesystem/data</value>

</property>

<property>

<name>mapred.system.dir</name>

<value>/index2/nutch/filesystem/mapreduce/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/index2/nutch/filesystem/mapreduce/local</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

<property>

<name>mapred.child.java.opts</name>

<value>-Xmx512m</value>

<description>Java opts for the task tracker child processes. Subsumes

'mapred.child.heap.size' (If a mapred.child.heap.size value is found

in a configuration, its maximum heap size will be used and a warning

emitted that heap.size has been deprecated). Also, the following
symbols,

if present, will be interpolated: @taskid@ is replaced by current
TaskID;

and @port@ will be replaced by mapred.task.tracker.report.port + 1 (A
second

child will fail with a port-in-use if mapred.tasktracker.tasks.maximum
is

greater than one). Any other occurrences of '@' will go unchanged. For

example, to enable verbose gc logging to a file named for the taskid in

/tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:

-Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]

</description>

</property>

 

 

<!-- i/o properties -->

<property>

<name>io.sort.factor</name>

<value>100</value>

<description>The number of streams to merge at once while sorting

files. This determines the number of open file handles.</description>
</property>

<property>

<name>io.sort.mb</name>

<value>500</value>

<description>The total amount of buffer memory to use while sorting

files, in megabytes. By default, gives each merge stream 1MB, which

should minimize seeks.</description>

</property>

 

</configuration>

 

 

Thanks for the great work on Hadoop!

Greg

 

 

Reply via email to