[Nutch-general] Nutch crawler ignores sites without default page

termopro Tue, 03 Oct 2006 07:10:32 -0700

I'm using intranet crawling. The URLS in the URLs files include the
filenames, e.g.


http://somedomain.com/page1.htm
http://otherdomain.com/page2.htm

Both sites have no index.htm page. When after crawling I use the
CrawlDbReader tool to  view the list of crawled pages, one of the
pages is fetched and  another is marked as gone.

I guess this may depend on the status answer the server gives when
conn ecting to http://somedomain.com or http://otherdomain.com,
whether it is 403 or 404.

But shouldn't Nutch just ignore the main page and request only
page1.htm or the page2.htm? 



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Nutch crawler ignores sites without default page

Reply via email to