Hi,

There is also a setting for the maximal number of bytes to fetch.  If your main 
index page is large, maybe it's just getting cut off because of that.  The 
property has "content" in the name, I believe, so look for that in 
nutch-default.xml.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: wynz lo <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Tuesday, June 17, 2008 6:18:26 PM
> Subject: problems with link limits
> 
> Hi everyone,
> 
> I've spent hours searching around trying to solve this and it's starting to
> drive me a little nuts. You all might be my last hope in staying out of a
> padded room.
> 
> I have one small site I'm trying to crawl. The site is a handful of
> different JSPs that are essentially templates for people's profiles. The
> different profile pages are generated by passing a uri parameter. Nutch is
> actually doing a fine job of crawling the smaller pages, but the main index
> is causing trouble.
> 
> The main index has a single list of 772 links in alphabetical order like
> this:
> 
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob
> ...and so on...
> 
> Nutch fetches about the first 90-110 (usually all the A's and B's) but
> that's it. I got real excited when I found the db.max.outlinks.per.page
> setting was at a default of 100. However, changing that to -1 or a high
> value doesn't fix the problem. When I change it to a small value, like 15,
> the fetcher grabs even fewer links, so it is definitely working.
> 
> Any suggestions? Thanks so much.
> 
> Wynz

Reply via email to