I tried a number of things to deal with the hung crawl that I reported
below (yes, some time back). I wasn't able to get output by sending a
SIGQUIT (tried several different things). I was parsing PDF files (lot of
useful info in them in my application). 

FWIW, I later increased the max size they would parse on the theory that
perhaps when they hit the relatively small default size of
http.content.limit (65536) a thread was hung. I was getting a number of
files that were too small to parse.

I ended up using the whole web method via a Perl script I wrote to make
running it more manageable. (I'd be glad to share it, but it is a real
hack). This seems to work fine for me.

      - Bill


Doug wrote:

> Bill Goffe wrote:
> >I'm having the same problem on a Xeon machine with a gig of RAM running
> >Debian and Sun's Java. I'm using the intranet method with about 10,000
> >sites. After 12-24 hours of running, it really slows down (can be several
> >minutes between fetches). It finally dies with with 
> >    SEVERE error writing output:java.lang.OutOfMemoryError
> 
> When it slows, can you get a stack dump by sending SIGQUIT?  My hunch is 
> that, slowly, more and more fetcher threads are somehow hung.  A stack 
> dump of all threads would show whether this is the case, and where they 
> are hung.  Are the pdf and word parsers enabled, as they are by default? 
>  These have been known to hang before.
> 
> Perhaps we should, by default, disable all parsers but html and all 
> protocols but http, in order to make the default configuration more 
> bulletproof.  Then folks can selectively enable more plugins and we'll 
> have a better idea of which are problematic.  What do folks think?
> 
> So one workaround would be to disable these, using something like:
> 
> <property>
>   <name>plugin.excludes</name>
>   <value>(protocol-(?!http).*)|(parse-(?!html).*)</value>
> </property>
> 
> Another workaround would be to use the whole-web method, and break your 
> fetchlists into smaller chunks that take less than 12 hours to fetch, 
> using, e.g., the -numFetchers parameter when generating fetchlists.  But 
> this is substantially more complicated if you're currently using the 
> crawl command.
> 
> Doug
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Sybase ASE Linux Express Edition - download now for FREE
> LinuxWorld Reader's Choice Award Winner for best database on Linux.
> http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers

-- 
         *------------------------------------------------------*
         | Bill Goffe                 [EMAIL PROTECTED]          |
         | Department of Economics    voice: (315) 312-3444     |
         | SUNY Oswego                fax:   (315) 312-5444     |
         | 416 Mahar Hall             <wuecon.wustl.edu/~goffe> |          
         | Oswego, NY  13126                                    |
*--------*------------------------------------------------------*-----------*
| "It's a scholarly activity that has nothing to do with his professional   |
|  activity."                                                               |
|    -- Samuel Landau, the lawyer for Roger Shepherd, who used to be a      |
|       professor at New School University until he left after admiting he  |
|       plagiarized part of a book. Landau argues that his book was         |
|       "totally unrelated" to his university work and he is suing to get   |
|       his job back. "Professor Who Acknowledged Plagiarism Accuses New    |
|       School U. of Firing Him Unfairly," Chronicle of Higher Education,   |
|       November 17, 2004.                                                  |
*---------------------------------------------------------------------------*



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to