I'm seeing a problem where pages are fetched, but are not indexed. I've pared the crawl down to a very small example using the plain Nutch crawl tool. It fails consistently with the same url (among others): http://new.marketwire.com/2.0/rel.jsp?id=710360. The url redirects, so a -depth option is required for the nutch command, and I have modified the crawl-urlfilter.txt file to allow fetching this file. It is definitely fetched, as is the page to which Nutch is redirected.
I've used "nutch readseg -dump ...", and it sure looks like the proper document was fetched to me. The content is there, and the parsed content is there and so on. The crawl datum looks OK too (to my naive eye). What is going on here? Is there any further debugging I can turn on to try to track this down? Thanks, C
