Hi everyone, I've spent hours searching around trying to solve this and it's starting to drive me a little nuts. You all might be my last hope in staying out of a padded room.
I have one small site I'm trying to crawl. The site is a handful of different JSPs that are essentially templates for people's profiles. The different profile pages are generated by passing a uri parameter. Nutch is actually doing a fine job of crawling the smaller pages, but the main index is causing trouble. The main index has a single list of 772 links in alphabetical order like this: http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob ...and so on... Nutch fetches about the first 90-110 (usually all the A's and B's) but that's it. I got real excited when I found the db.max.outlinks.per.page setting was at a default of 100. However, changing that to -1 or a high value doesn't fix the problem. When I change it to a small value, like 15, the fetcher grabs even fewer links, so it is definitely working. Any suggestions? Thanks so much. Wynz
