On Fri, Nov 12, 2004 at 10:41:49AM +0100, Doug Cutting wrote: > Bill Goffe wrote: > >I'm having the same problem on a Xeon machine with a gig of RAM running > >Debian and Sun's Java. I'm using the intranet method with about 10,000 > >sites. After 12-24 hours of running, it really slows down (can be several > >minutes between fetches). It finally dies with with > > SEVERE error writing output:java.lang.OutOfMemoryError > > When it slows, can you get a stack dump by sending SIGQUIT? My hunch is > that, slowly, more and more fetcher threads are somehow hung. A stack > dump of all threads would show whether this is the case, and where they > are hung. Are the pdf and word parsers enabled, as they are by default? > These have been known to hang before. > > Perhaps we should, by default, disable all parsers but html and all > protocols but http, in order to make the default configuration more > bulletproof. Then folks can selectively enable more plugins and we'll > have a better idea of which are problematic. What do folks think? > > So one workaround would be to disable these, using something like: > > <property> > <name>plugin.excludes</name> > <value>(protocol-(?!http).*)|(parse-(?!html).*)</value> > </property>
This makes sense. So that new users won't get frustrated. > > Another workaround would be to use the whole-web method, and break your > fetchlists into smaller chunks that take less than 12 hours to fetch, > using, e.g., the -numFetchers parameter when generating fetchlists. But > this is substantially more complicated if you're currently using the > crawl command. There is one more way to debug: run fetcher with -noParsing option, then run ./bin/nutch parse. John ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
