The I think that in crawldb and linkdb, the actual url, without the query string, serves as primary key (i.e. a url is determined as unique just by looking at the url, without the query string). Thus, after your first page is fetched, and you run updatedb, nutch doesn't think that it needs to fetch it again because it already sees an entry for it in the database.
I am also new to nutch, so I don't know if there is a solution to your problem. Yoni On Wed, 2006-12-06 at 10:05 -0500, spamsucks wrote: > My subject is a pretty good summary. I see the first "details.pa?id=123" in > my results, but can't search or find any "details.pa?id=456" links that are > in that 1st page that was a hit. > > Backgrounder: > I have a site that includes a lot of dynamic pages. I edited the > crawl-urlfilter.txt and added the following regex and did > a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000): > > +^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id= > > Now the search will return hits on the dynamic details page. For example, > here is a search that returns hits on my dynamic pages. > http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10〈=en > > If you look at the details.pa page that nutch had a hit on, it contains > several links of the same format ( details.pa ) > My problem is that these other detail links are not being crawled/indexed. > > I set the depth to "30" so that should not be a limiting factor. I also set > a "topN" of 30000, because we have around 16K details.pa pages > > Any clues on how to proceed and figure out what I need to do to get Nutch to > crawl these missing "details.pa" links > > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general