Otis, Thank you so much, that fixed the problem immediately! The default for http.content.limit was 64k and my index list was over 400k (those long URIs really beef up the document size).
-Wynz On Wed, Jun 18, 2008 at 12:45 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > There is also a setting for the maximal number of bytes to fetch. If your > main index page is large, maybe it's just getting cut off because of that. > The property has "content" in the name, I believe, so look for that in > nutch-default.xml. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- > > From: wynz lo <[EMAIL PROTECTED]> > > To: [email protected] > > Sent: Tuesday, June 17, 2008 6:18:26 PM > > Subject: problems with link limits > > > > Hi everyone, > > > > I've spent hours searching around trying to solve this and it's starting > to > > drive me a little nuts. You all might be my last hope in staying out of a > > padded room. > > > > I have one small site I'm trying to crawl. The site is a handful of > > different JSPs that are essentially templates for people's profiles. The > > different profile pages are generated by passing a uri parameter. Nutch > is > > actually doing a fine job of crawling the smaller pages, but the main > index > > is causing trouble. > > > > The main index has a single list of 772 links in alphabetical order like > > this: > > > > > http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca > > > http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice > > > http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert > > > http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry > > > http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob > > ...and so on... > > > > Nutch fetches about the first 90-110 (usually all the A's and B's) but > > that's it. I got real excited when I found the db.max.outlinks.per.page > > setting was at a default of 100. However, changing that to -1 or a high > > value doesn't fix the problem. When I change it to a small value, like > 15, > > the fetcher grabs even fewer links, so it is definitely working. > > > > Any suggestions? Thanks so much. > > > > Wynz > >
