I tried a number of things to deal with the hung crawl that I reported
below (yes, some time back). I wasn't able to get output by sending a
SIGQUIT (tried several different things). I was parsing PDF files (lot of
useful info in them in my application).
FWIW, I later increased the max size they would parse on the theory that
perhaps when they hit the relatively small default size of
http.content.limit (65536) a thread was hung. I was getting a number of
files that were too small to parse.
I ended up using the whole web method via a Perl script I wrote to make
running it more manageable. (I'd be glad to share it, but it is a real
hack). This seems to work fine for me.
- Bill
Doug wrote:
> Bill Goffe wrote:
> >I'm having the same problem on a Xeon machine with a gig of RAM running
> >Debian and Sun's Java. I'm using the intranet method with about 10,000
> >sites. After 12-24 hours of running, it really slows down (can be several
> >minutes between fetches). It finally dies with with
> > SEVERE error writing output:java.lang.OutOfMemoryError
>
> When it slows, can you get a stack dump by sending SIGQUIT? My hunch is
> that, slowly, more and more fetcher threads are somehow hung. A stack
> dump of all threads would show whether this is the case, and where they
> are hung. Are the pdf and word parsers enabled, as they are by default?
> These have been known to hang before.
>
> Perhaps we should, by default, disable all parsers but html and all
> protocols but http, in order to make the default configuration more
> bulletproof. Then folks can selectively enable more plugins and we'll
> have a better idea of which are problematic. What do folks think?
>
> So one workaround would be to disable these, using something like:
>
> <property>
> <name>plugin.excludes</name>
> <value>(protocol-(?!http).*)|(parse-(?!html).*)</value>
> </property>
>
> Another workaround would be to use the whole-web method, and break your
> fetchlists into smaller chunks that take less than 12 hours to fetch,
> using, e.g., the -numFetchers parameter when generating fetchlists. But
> this is substantially more complicated if you're currently using the
> crawl command.
>
> Doug
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Sybase ASE Linux Express Edition - download now for FREE
> LinuxWorld Reader's Choice Award Winner for best database on Linux.
> http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
--
*------------------------------------------------------*
| Bill Goffe [EMAIL PROTECTED] |
| Department of Economics voice: (315) 312-3444 |
| SUNY Oswego fax: (315) 312-5444 |
| 416 Mahar Hall <wuecon.wustl.edu/~goffe> |
| Oswego, NY 13126 |
*--------*------------------------------------------------------*-----------*
| "It's a scholarly activity that has nothing to do with his professional |
| activity." |
| -- Samuel Landau, the lawyer for Roger Shepherd, who used to be a |
| professor at New School University until he left after admiting he |
| plagiarized part of a book. Landau argues that his book was |
| "totally unrelated" to his university work and he is suing to get |
| his job back. "Professor Who Acknowledged Plagiarism Accuses New |
| School U. of Firing Him Unfairly," Chronicle of Higher Education, |
| November 17, 2004. |
*---------------------------------------------------------------------------*
-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers