Amazon.com, as a common example, has pages with links that do not include the www.amazon.com prefix. The prefix is automatically prepended by the page itself upon reference and the subsequent composite link successfully resolves. You can manually perform the same composition, adding www.amazon.com, to an amazon URL to resolve to an Amazon page.
I think I am observing that Nutch can crawl these pages if the crawl-urlfilter.txt patterns are weakened by not requiring an amazon.com in the URL filter but then one begins crawling out of the amazon.com site. So my question: does anyone have a suggestion for a crawl-urlfilter pattern that achieves the goal of traversing only a site like amazon.com without leaving the site or is there another mechanism for doing this? If I am misunderstanding the situation then an explanation of the misunderstanding would be appreciated. Thank you, Jim Van Sciver
