Bill Goffe wrote:
I'm having the same problem on a Xeon machine with a gig of RAM running
Debian and Sun's Java. I'm using the intranet method with about 10,000
sites. After 12-24 hours of running, it really slows down (can be several
minutes between fetches). It finally dies with with SEVERE error writing output:java.lang.OutOfMemoryError

When it slows, can you get a stack dump by sending SIGQUIT? My hunch is that, slowly, more and more fetcher threads are somehow hung. A stack dump of all threads would show whether this is the case, and where they are hung. Are the pdf and word parsers enabled, as they are by default? These have been known to hang before.


Perhaps we should, by default, disable all parsers but html and all protocols but http, in order to make the default configuration more bulletproof. Then folks can selectively enable more plugins and we'll have a better idea of which are problematic. What do folks think?

So one workaround would be to disable these, using something like:

<property>
  <name>plugin.excludes</name>
  <value>(protocol-(?!http).*)|(parse-(?!html).*)</value>
</property>

Another workaround would be to use the whole-web method, and break your fetchlists into smaller chunks that take less than 12 hours to fetch, using, e.g., the -numFetchers parameter when generating fetchlists. But this is substantially more complicated if you're currently using the crawl command.

Doug



-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to