I'm using intranet crawling. The URLS in the URLs files include the filenames, e.g.
http://somedomain.com/page1.htm http://otherdomain.com/page2.htm Both sites have no index.htm page. When after crawling I use the CrawlDbReader tool to view the list of crawled pages, one of the pages is fetched and another is marked as gone. I guess this may depend on the status answer the server gives when conn ecting to http://somedomain.com or http://otherdomain.com, whether it is 403 or 404. But shouldn't Nutch just ignore the main page and request only page1.htm or the page2.htm? ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
