[Nutch-general] Re: Corrupt NDFS?

Chris Schneider Sat, 11 Feb 2006 11:24:00 -0800

Gang,

We have apparently solved our NDFS problem, and we think it stemmedfrom the open file limit (ulimit -n) being too low. This is thesecond time we've increased this on the master machine, and it is now16K.

In order to verify that this was indeed the problem, I tried settingit back to 8K. Unfortunately, the readdb <crawldb> -stats commandstill completed normally. However, I've noticed that readdb seems torun more quickly the second time than the first time. I theorize thatit caches some intermediate results, and so it must not have neededso many open files the second time around.

However, it could be that the fix to our problem had to do withsomething else we tried yesterday. Here's a report from our systemadministrator on what he tried yesterday:


At 5:24 PM -0800 2/10/06, [EMAIL PROTECTED] wrote:

1) disabled some unnecesary junk.

autofs, runlevel 5, default now is runlevel 3

2) disabled selinux (over reaching security enhancement thats prevents a
number of things from operating "normally". i.e. it was preventing apache
from opening a socket to talk to resin. i did this server by server
attempting to observe differences in the behaviour

3) finally increased open file limits on m1 because of errors on the S
servers reporting "connection refused". even though this should have been
logged somewhere on m1, open files limit reached. (it wasn't). --

Bottom line: When you have problems with Nutch (particularly if itwas working before), try the following:

1) Increase ipc.client.timeout (we currently have this set to1800000, or 30 minutes).


2) Increase open file limit (we currently have this set to 16K).

Happy Nutching!

- Chris

At 7:05 AM -0800 2/10/06, Chris Schneider wrote:

Nutch Buddies,
Our NDFS system appears to be sick. Has anyone else encountered anyof the following symptoms:
1) We started noticing that during start-all.sh, the JobTrackerwasn't coming up properly (at least not right away). It would dump 6of these errors in its log:
060207 230322 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060207 230322 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060207 230322 Starting tracker
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:507)
        at java.net.Socket.connect(Socket.java:457)
        at java.net.Socket.<init>(Socket.java:365)
        at java.net.Socket.<init>(Socket.java:207)
        at org.apache.nutch.ipc.Client$Connection.<init>(Client.java:110)
        at org.apache.nutch.ipc.Client.getConnection(Client.java:343)
        at org.apache.nutch.ipc.Client.call(Client.java:281)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy0.isDir(Unknown Source)
        at org.apache.nutch.ndfs.NDFSClient.isDirectory(NDFSClient.java:111)
atorg.apache.nutch.fs.NDFSFileSystem.isDirectory(NDFSFileSystem.java:97)
        at org.apache.nutch.fs.NutchFileSystem.delete(NutchFileSystem.java:245)
        at org.apache.nutch.fs.FileUtil.fullyDelete(FileUtil.java:39)
        at org.apache.nutch.mapred.JobTracker.<init>(JobTracker.java:221)
        at org.apache.nutch.mapred.JobTracker.startTracker(JobTracker.java:45)
        at org.apache.nutch.mapred.JobTracker.main(JobTracker.java:1070)
Eventually, we figured out that this was due to the NameNode takinglonger than expected to come up. Increasing the sleep parameter innutch-daemon.sh from 1 second to 10 seconds resolved this problem.
2) We're able to use bin/nutch ndfs -ls to poke around inside ofNDFS, and all seems well. We never get any errors, but we'reobviously not hitting it very hard, nor are we launching anyMapReduce jobs in this case.
3) Whenever we do launch a MapReduce job (e.g., readdb <crawldb>-stats), it starts out OK, but then...
a) a few of the map tasks choke with ArrayIndexOutOfBounds Exception:
060209 215223 task_m_daelb1 Array bounds problem inNFSDataInputStream.Checker.Read: summed=2512, read=4096, goal=-511,bytesPerSum=1, inSum=512, inBuf=1584, toSum=-511, b.length=4096,off=0
060209 215223 task_m_daelb1  Error running child
060209 215223 task_m_daelb1 java.lang.ArrayIndexOutOfBoundsException
060209 215223 task_m_daelb1     at java.util.zip.CRC32.update(CRC32.java:43)
060209 215223 task_m_daelb1 atorg.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:93)060209 215223 task_m_daelb1 atorg.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:173)060209 215223 task_m_daelb1 atjava.io.BufferedInputStream.fill(BufferedInputStream.java:218)060209 215223 task_m_daelb1 atjava.io.BufferedInputStream.read1(BufferedInputStream.java:256)060209 215223 task_m_daelb1 atjava.io.BufferedInputStream.read(BufferedInputStream.java:313)060209 215223 task_m_daelb1 atjava.io.DataInputStream.readFully(DataInputStream.java:176)060209 215223 task_m_daelb1 atorg.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)060209 215223 task_m_daelb1 atorg.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)060209 215223 task_m_daelb1 atorg.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)060209 215223 task_m_daelb1 atorg.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)060209 215223 task_m_daelb1 atorg.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)060209 215223 task_m_daelb1 atorg.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)060209 215223 task_m_daelb1 atorg.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)060209 215223 task_m_daelb1 atorg.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)060209 215223 task_m_daelb1 atorg.apache.nutch.mapred.MapTask.run(MapTask.java:116)060209 215223 task_m_daelb1 atorg.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060209 215223 Server connection on port 50050 from 127.0.0.1: exiting
060209 215226 task_m_daelb1 done; removing files.
As you can see, I added some diagnostic output toNFSDataInputStream.Checker.read() to dump out a bunch of variables.I'm still looking into what might have caused bytesPerSum to remainonly 1 (what it's originally initialized to), what this means, whythe code didn't anticipate inSum being so much larger, etc.
b) The reduce section begins even though it doesn't seem like themap section really finished successfully.
c) Nearly all of the reduce tasks seem to choke with timeouts beforegetting even 10% finished.
d) The longer I let this go, the more of the map tasks thatpreviously claimed to have completed change their status to 0.0percent complete, or have ArrayIndexOutOfBoundsExceptions, or readtime-outs:
java.net.SocketTimeoutException: Read timed out atjava.net.SocketInputStream.socketRead0(Native Method) atjava.net.SocketInputStream.read(SocketInputStream.java:129) atjava.io.BufferedInputStream.fill(BufferedInputStream.java:218) atjava.io.BufferedInputStream.read1(BufferedInputStream.java:256) atjava.io.BufferedInputStream.read(BufferedInputStream.java:313) atjava.io.DataInputStream.read(DataInputStream.java:134) atorg.apache.nutch.ndfs.NDFSClient$NDFSInputStream.read(NDFSClient.java:440)atorg.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:82)atorg.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:173)at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) atjava.io.BufferedInputStream.read1(BufferedInputStream.java:256) atjava.io.BufferedInputStream.read(BufferedInputStream.java:313) atjava.io.DataInputStream.readFully(DataInputStream.java:176) atorg.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)atorg.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)atorg.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)atorg.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)atorg.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)atorg.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106) atorg.apache.nutch.mapred.MapRunner.run(MapRunner.java:48) atorg.apache.nutch.mapred.MapTask.run(MapTask.java:116) atorg.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
e) Eventually, TaskTrackers stop responding with heartbeats, andeventually drop off the list. However, these TaskTrackers, theirchildren processes, and the DataNodes on the same machines are allstill running.
More importantly, once they get into this state, these processesrefuse to go away. I'm not that surprised that they don't respond tostop-all.sh, but they also ignore kill -9 and prevent a soft reboot.Our only option at this point is to have someone in the room powercycle the machines.
Finally, we were already having problems with our crawl (sockettimeouts, etc.), but one change we did make is to move the tmpdirectory on each of the slaves to a different (larger) hard drive.We were using a symbolic link from /home/crawler/tmp already, so wejust copied the contents from each old drive to each new drive, thenpointed each link at the location on the new drive. I know thesymbolic links work, because I've checked them all by hand. It makesme wonder, though, whether NDFS might have cached a resolvedpathname somewhere, which our disk swap has rendered invalid.
Any help/guidance would be greatly appreciated.

Thanks,

- Chris


--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

[Nutch-general] Re: Corrupt NDFS?

Reply via email to