documents fetched but not indexed (Nutch 0.9)

charlie w Tue, 24 Jul 2007 15:54:35 -0700

I'm seeing a problem where pages are fetched, but are not indexed.  I've
pared the crawl down to a very small example using the plain Nutch crawl
tool.  It fails consistently with the same url (among others):
http://new.marketwire.com/2.0/rel.jsp?id=710360.  The url redirects, so a
-depth option is required for the nutch command, and I have modified the
crawl-urlfilter.txt file to allow fetching this file.  It is definitely
fetched, as is the page to which Nutch is redirected.


I've used "nutch readseg -dump ...", and it sure looks like the proper
document was fetched to me.  The content is there, and the parsed content is
there and so on.  The crawl datum looks OK too (to my naive eye).

What is going on here?  Is there any further debugging I can turn on to try
to track this down?

Thanks,
C

documents fetched but not indexed (Nutch 0.9)

Reply via email to