My subject is a pretty good summary.  I see the first "details.pa?id=123" in 
my results, but can't search or find any "details.pa?id=456" links that are 
in that 1st page that was a hit.

Backgrounder:
I have a site that includes a lot of dynamic pages.  I edited the 
crawl-urlfilter.txt and added the following regex and did
a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):

+^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=

Now the search will return hits on the dynamic details page.  For example,
here is a search that returns hits on my dynamic pages.
http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10&lang=en

If you look at the details.pa page that nutch had a hit on, it contains 
several links of the same format ( details.pa )
My problem is that these other detail links are not being crawled/indexed.

I set the depth to "30" so that should not be a limiting factor.  I also set 
a "topN" of 30000, because we have around 16K details.pa pages

Any clues on how to proceed and figure out what I need to do to get Nutch to 
crawl these missing "details.pa" links






-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to