Amazon.com, as a common example, has pages with links that do not
include the www.amazon.com prefix.  The prefix is automatically
prepended by the page upon reference and the subsequent composite link
successfully resolves.

I think I am observing that Nutch can crawl these pages if the
crawl-urlfilter.txt patterns are weakened by not requiring an
amazon.com in the URL filter but then one begins crawling out of the
amazon.com site.

Does anyone have a suggestion for a crawl-urlfilter pattern that
achieves my desired goal or another mechanism for doing so?  Or
perhaps I am misunderstanding, in which case an explanation would be
appreciated.

Thank you, in advance.
Jim Van Sciver

Reply via email to