[Nutch-dev] RE: Fetcher Hung

Nutch Crawler Sun, 14 Nov 2004 21:54:00 -0800

> When it slows, can you get a stack dump by sending SIGQUIT?  My hunch is


Here is a sample of the SIGQUIT results, the fetchers are in one of
these two states:

"fetcher12" prio=1 tid=0x082f8448 nid=0x18cb runnable [0x6de56000..0x6de56600]
       at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
       at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:838)
       at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1176)
       at java.net.InetAddress.getAllByName0(InetAddress.java:1126)
       at java.net.InetAddress.getAllByName0(InetAddress.java:1098)
       at java.net.InetAddress.getAllByName(InetAddress.java:1061)
       at java.net.InetAddress.getByName(InetAddress.java:958)
       at java.net.InetSocketAddress.<init>(InetSocketAddress.java:124)
       at net.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:94)
       at net.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:53)
       at 
net.nutch.protocol.http.RobotRulesParser.isAllowed(RobotRulesParser.java:364)
       at net.nutch.protocol.http.Http.getContent(Http.java:145)
       at net.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:106)

"fetcher13" prio=1 tid=0x082f96b0 nid=0x18cc in Object.wait()
[0x6ddd5000..0x6ddd5480]
       at java.lang.Object.wait(Native Method)
       - waiting on <0x763a7238> (a java.util.HashMap)
       at java.lang.Object.wait(Object.java:474)
       at java.net.InetAddress.checkLookupTable(InetAddress.java:1226)
       - locked <0x763a7238> (a java.util.HashMap)
       at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1165)
       at java.net.InetAddress.getAllByName0(InetAddress.java:1126)
       at java.net.InetAddress.getAllByName0(InetAddress.java:1098)
       at java.net.InetAddress.getAllByName(InetAddress.java:1061)
       at java.net.InetAddress.getByName(InetAddress.java:958)
       at java.net.InetSocketAddress.<init>(InetSocketAddress.java:124)
       at net.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:94)
       at net.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:53)
       at 
net.nutch.protocol.http.RobotRulesParser.isAllowed(RobotRulesParser.java:364)
       at net.nutch.protocol.http.Http.getContent(Http.java:145)
       at net.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:106)

>  Are the pdf and word parsers enabled, as they are by default?
>   These have been known to hang before.
...
> So one workaround would be to disable these, using something like:
> <property>
>    <name>plugin.excludes</name>
>    <value>(protocol-(?!http).*)|(parse-(?!html).*)</value>
> </property>

I have turned off all parsers except html parsers by setting this property.

> Another workaround would be to use the whole-web method, and break your
> fetchlists into smaller chunks that take less than 12 hours to fetch,
> using, e.g., the -numFetchers parameter when generating fetchlists.  But
> this is substantially more complicated if you"re currently using the
> crawl command.

I am using the whole web method.

Intel(R) Pentium(R) 4 CPU 2.80GHz
WhiteBox Linux 2.4.21-20.EL
java version "1.5.0"
nutch-2004-11-12.tar.gz

Let me know if you need any other info.
Any suggestions would be helpful.
Thanks,
Ralph


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: Fetcher Hung

Reply via email to