Look in nutch-default.xml

The properties db.max.outlinks.per.page and http.content.limit might need to have their values increased.

Cheers,
Carl.


Jeff Maki wrote:
Hello everyone,

I'm not going to post my config files as not to spam you all, but I
have a general question: I'm trying to index the pages of a website
(obviously), and I've created a special page with a link to all the
pages I want to index. I then pointed nutch to this special link page.
I set max_outlinks appropriately, and I do see all the page URLs I
expect go by in the log for the fetching stage.

When nutch gets to indexing, however, not all the documents appear in
the log--it looks as if not all of the fetched pages are being
indexed. Searching for terms I know are on the missing pages also
turns up nothing--they're not in the index!?

Can anybody tell me what factors affect the indexing stage? I want to
have nutch index *all* documents it fetches. How can I do this?

Any tips/ideas/things to configure?

Thanks in advance,

-Jeff

_____________________________________________________________________

This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________


Reply via email to