Newbie question: crawling sites like amazon.com without leaving site

Jim Van Sciver Mon, 06 Oct 2008 13:57:32 -0700

Amazon.com, as a common example, has pages with links that do not
include the www.amazon.com prefix.  The prefix is automatically
prepended by the page itself upon reference and the subsequent composite
link successfully resolves.  You can manually perform the same composition,
adding www.amazon.com, to an amazon URL to resolve to an Amazon page.


I think I am observing that Nutch can crawl these pages if the
crawl-urlfilter.txt
patterns are weakened by not requiring an amazon.com in the URL filter but
then one begins crawling out of the amazon.com site.

So my question: does anyone have a suggestion for a crawl-urlfilter pattern that
achieves the goal of traversing only a site like amazon.com without leaving the
site or is there another mechanism for doing this?  If I am misunderstanding
the situation then an explanation of the misunderstanding would be appreciated.

Thank you,
Jim Van Sciver

Newbie question: crawling sites like amazon.com without leaving site

Reply via email to