I tried a number of things to deal with the hung crawl that I reported
below (yes, some time back). I wasn't able to get output by sending a
SIGQUIT (tried several different things). I was parsing PDF files (lot of
useful info in them in my application).
In some cases you may not be able to get a thread dump from the JVM. One situation is when the process is swapping heavily and it's almost permanently in disk I/O wait state. Another situation is when a JVM executes a blocking syscall, which for some reason takes a long time to complete. There is also a (remote) possibility that this particular code triggers a JVM bug, but in that case you should see the JVM crashing...
You should be able to see the process state using tools like 'ps' or 'top'. If it's in the RUN state, then you should be able to get something...
FWIW, I later increased the max size they would parse on the theory that perhaps when they hit the relatively small default size of http.content.limit (65536) a thread was hung. I was getting a number of files that were too small to parse.
Most if not all PDF parsers need complete files. Any file that is truncated will be skipped.
I ended up using the whole web method via a Perl script I wrote to make running it more manageable. (I'd be glad to share it, but it is a real hack). This seems to work fine for me.
If you are sure that it's the PDF parser that's causing you trouble, you could try fetching with -noParsing option, and then running ParseSegment tool. But instead of using the default parse-pdf plugin you could use the parse-ext plugin together with e.g. pdftohtml or xpdf.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
