Nutch Buddies,
Our NDFS system appears to be sick. Has anyone else encountered any
of the following symptoms:
1) We started noticing that during start-all.sh, the JobTracker
wasn't coming up properly (at least not right away). It would dump 6
of these errors in its log:
060207 230322 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060207 230322 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060207 230322 Starting tracker
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:507)
at java.net.Socket.connect(Socket.java:457)
at java.net.Socket.<init>(Socket.java:365)
at java.net.Socket.<init>(Socket.java:207)
at org.apache.nutch.ipc.Client$Connection.<init>(Client.java:110)
at org.apache.nutch.ipc.Client.getConnection(Client.java:343)
at org.apache.nutch.ipc.Client.call(Client.java:281)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy0.isDir(Unknown Source)
at org.apache.nutch.ndfs.NDFSClient.isDirectory(NDFSClient.java:111)
at
org.apache.nutch.fs.NDFSFileSystem.isDirectory(NDFSFileSystem.java:97)
at org.apache.nutch.fs.NutchFileSystem.delete(NutchFileSystem.java:245)
at org.apache.nutch.fs.FileUtil.fullyDelete(FileUtil.java:39)
at org.apache.nutch.mapred.JobTracker.<init>(JobTracker.java:221)
at org.apache.nutch.mapred.JobTracker.startTracker(JobTracker.java:45)
at org.apache.nutch.mapred.JobTracker.main(JobTracker.java:1070)
Eventually, we figured out that this was due to the NameNode taking
longer than expected to come up. Increasing the sleep parameter in
nutch-daemon.sh from 1 second to 10 seconds resolved this problem.
2) We're able to use bin/nutch ndfs -ls to poke around inside of
NDFS, and all seems well. We never get any errors, but we're
obviously not hitting it very hard, nor are we launching any
MapReduce jobs in this case.
3) Whenever we do launch a MapReduce job (e.g., readdb <crawldb>
-stats), it starts out OK, but then...
a) a few of the map tasks choke with ArrayIndexOutOfBounds Exception:
060209 215223 task_m_daelb1 Array bounds problem in
NFSDataInputStream.Checker.Read: summed=2512, read=4096, goal=-511,
bytesPerSum=1, inSum=512, inBuf=1584, toSum=-511, b.length=4096, off=0
060209 215223 task_m_daelb1 Error running child
060209 215223 task_m_daelb1 java.lang.ArrayIndexOutOfBoundsException
060209 215223 task_m_daelb1 at java.util.zip.CRC32.update(CRC32.java:43)
060209 215223 task_m_daelb1 at
org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:93)
060209 215223 task_m_daelb1 at
org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:173)
060209 215223 task_m_daelb1 at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
060209 215223 task_m_daelb1 at
java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
060209 215223 task_m_daelb1 at
java.io.BufferedInputStream.read(BufferedInputStream.java:313)
060209 215223 task_m_daelb1 at
java.io.DataInputStream.readFully(DataInputStream.java:176)
060209 215223 task_m_daelb1 at
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060209 215223 task_m_daelb1 at
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060209 215223 task_m_daelb1 at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
060209 215223 task_m_daelb1 at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
060209 215223 task_m_daelb1 at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
060209 215223 task_m_daelb1 at
org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
060209 215223 task_m_daelb1 at
org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
060209 215223 task_m_daelb1 at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
060209 215223 task_m_daelb1 at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060209 215223 task_m_daelb1 at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060209 215223 Server connection on port 50050 from 127.0.0.1: exiting
060209 215226 task_m_daelb1 done; removing files.
As you can see, I added some diagnostic output to
NFSDataInputStream.Checker.read() to dump out a bunch of variables.
I'm still looking into what might have caused bytesPerSum to remain
only 1 (what it's originally initialized to), what this means, why
the code didn't anticipate inSum being so much larger, etc.
b) The reduce section begins even though it doesn't seem like the map
section really finished successfully.
c) Nearly all of the reduce tasks seem to choke with timeouts before
getting even 10% finished.
d) The longer I let this go, the more of the map tasks that
previously claimed to have completed change their status to 0.0
percent complete, or have ArrayIndexOutOfBoundsExceptions, or read
time-outs:
java.net.SocketTimeoutException: Read timed out at
java.net.SocketInputStream.socketRead0(Native Method) at
java.net.SocketInputStream.read(SocketInputStream.java:129) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at
java.io.BufferedInputStream.read(BufferedInputStream.java:313) at
java.io.DataInputStream.read(DataInputStream.java:134) at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.read(NDFSClient.java:440)
at
org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:82)
at
org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:173)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at
java.io.BufferedInputStream.read(BufferedInputStream.java:313) at
java.io.DataInputStream.readFully(DataInputStream.java:176) at
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
at
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
at
org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106) at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48) at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116) at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
e) Eventually, TaskTrackers stop responding with heartbeats, and
eventually drop off the list. However, these TaskTrackers, their
children processes, and the DataNodes on the same machines are all
still running.
More importantly, once they get into this state, these processes
refuse to go away. I'm not that surprised that they don't respond to
stop-all.sh, but they also ignore kill -9 and prevent a soft reboot.
Our only option at this point is to have someone in the room power
cycle the machines.
Finally, we were already having problems with our crawl (socket
timeouts, etc.), but one change we did make is to move the tmp
directory on each of the slaves to a different (larger) hard drive.
We were using a symbolic link from /home/crawler/tmp already, so we
just copied the contents from each old drive to each new drive, then
pointed each link at the location on the new drive. I know the
symbolic links work, because I've checked them all by hand. It makes
me wonder, though, whether NDFS might have cached a resolved pathname
somewhere, which our disk swap has rendered invalid.
Any help/guidance would be greatly appreciated.
Thanks,
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------