Newbie question: crawling sites like amazon.com without leaving site

Jim Van Sciver Fri, 03 Oct 2008 14:24:26 -0700

Amazon.com, as a common example, has pages with links that do not
include the www.amazon.com prefix.  The prefix is automatically
prepended by the page upon reference and the subsequent composite link
successfully resolves.


I think I am observing that Nutch can crawl these pages if the
crawl-urlfilter.txt patterns are weakened by not requiring an
amazon.com in the URL filter but then one begins crawling out of the
amazon.com site.

Does anyone have a suggestion for a crawl-urlfilter pattern that
achieves my desired goal or another mechanism for doing so?  Or
perhaps I am misunderstanding, in which case an explanation would be
appreciated.

Thank you, in advance.
Jim Van Sciver

Newbie question: crawling sites like amazon.com without leaving site

Reply via email to