Hi Philip, You have www.visitpa.com in your crawl-url-filter regexp. If some of your other pages have <something else>.visitpa.com as host name they will be filtered out. You may want to have just (....)visitpa.com in the regexp in that case. Just a thought.
Nitin Borwankar http://tagschema.com spamsucks wrote: > Hi Yoni, > > That was a good thought, however, according to the logging output of > the crawl, I see the following... > > fetching http://www.visitpa.com/visitpa/details.pa?id=65851 > fetching http://www.visitpa.com/visitpa/details.pa?id=246139 > fetching http://www.visitpa.com/visitpa/details.pa?id=8427 > > There are at least 100+ of these (too many to count) so it appears > that nutch is fetching these url's although the url is not unique > without the query string. > > Building upon your thought, perhaps the other "details.pa" pages are > coming from other pages being indexed, and only one "details.pa" page > is being used in the sense of a crawl. That could be what is > happening here and your point is correct. > > I appreciate your response! > Phillip > > > > ----- Original Message ----- From: "Yoni Amir" <[EMAIL PROTECTED]> > To: <nutch-user@lucene.apache.org> > Sent: Wednesday, December 06, 2006 10:47 AM > Subject: Re: page1 is crawled, but not pages in page1 > > > The I think that in crawldb and linkdb, the actual url, without the > query string, serves as primary key (i.e. a url is determined as unique > just by looking at the url, without the query string). Thus, after your > first page is fetched, and you run updatedb, nutch doesn't think that it > needs to fetch it again because it already sees an entry for it in the > database. > > I am also new to nutch, so I don't know if there is a solution to your > problem. > > Yoni > > On Wed, 2006-12-06 at 10:05 -0500, spamsucks wrote: > >> My subject is a pretty good summary. I see the first >> "details.pa?id=123" in >> my results, but can't search or find any "details.pa?id=456" links >> that are >> in that 1st page that was a hit. >> >> Backgrounder: >> I have a site that includes a lot of dynamic pages. I edited the >> crawl-urlfilter.txt and added the following regex and did >> a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000): >> >> +^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id= >> >> Now the search will return hits on the dynamic details page. For >> example, >> here is a search that returns hits on my dynamic pages. >> http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10〈=en >> >> >> >> If you look at the details.pa page that nutch had a hit on, it contains >> several links of the same format ( details.pa ) >> My problem is that these other detail links are not being >> crawled/indexed. >> >> I set the depth to "30" so that should not be a limiting factor. I >> also set >> a "topN" of 30000, because we have around 16K details.pa pages >> >> Any clues on how to proceed and figure out what I need to do to get >> Nutch to >> crawl these missing "details.pa" links >> >> >> >> >> > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general