On Fri, Nov 12, 2004 at 10:41:49AM +0100, Doug Cutting wrote:
> Bill Goffe wrote:
> >I'm having the same problem on a Xeon machine with a gig of RAM running
> >Debian and Sun's Java. I'm using the intranet method with about 10,000
> >sites. After 12-24 hours of running, it really slows down (can be several
> >minutes between fetches). It finally dies with with 
> >    SEVERE error writing output:java.lang.OutOfMemoryError
> 
> When it slows, can you get a stack dump by sending SIGQUIT?  My hunch is 
> that, slowly, more and more fetcher threads are somehow hung.  A stack 
> dump of all threads would show whether this is the case, and where they 
> are hung.  Are the pdf and word parsers enabled, as they are by default? 
>  These have been known to hang before.
> 
> Perhaps we should, by default, disable all parsers but html and all 
> protocols but http, in order to make the default configuration more 
> bulletproof.  Then folks can selectively enable more plugins and we'll 
> have a better idea of which are problematic.  What do folks think?
> 
> So one workaround would be to disable these, using something like:
> 
> <property>
>   <name>plugin.excludes</name>
>   <value>(protocol-(?!http).*)|(parse-(?!html).*)</value>
> </property>

This makes sense. So that new users won't get frustrated.

> 
> Another workaround would be to use the whole-web method, and break your 
> fetchlists into smaller chunks that take less than 12 hours to fetch, 
> using, e.g., the -numFetchers parameter when generating fetchlists.  But 
> this is substantially more complicated if you're currently using the 
> crawl command.

There is one more way to debug: run fetcher with -noParsing option,
then run ./bin/nutch parse. 

John


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to