My subject is a pretty good summary. I see the first "details.pa?id=123" in my results, but can't search or find any "details.pa?id=456" links that are in that 1st page that was a hit.

Backgrounder:
I have a site that includes a lot of dynamic pages. I edited the crawl-urlfilter.txt and added the following regex and did
a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):

+^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=

Now the search will return hits on the dynamic details page.  For example,
here is a search that returns hits on my dynamic pages.
http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10&lang=en

If you look at the details.pa page that nutch had a hit on, it contains several links of the same format ( details.pa )
My problem is that these other detail links are not being crawled/indexed.

I set the depth to "30" so that should not be a limiting factor. I also set a "topN" of 30000, because we have around 16K details.pa pages

Any clues on how to proceed and figure out what I need to do to get Nutch to crawl these missing "details.pa" links





Reply via email to