Hello, I am using nutch-0.9 for indexing html files which are present on the same server ( Server1) local as nutch. Thus, I am using the protocol-file for fetching and subsequently indexing. This is working out just great.
My problem is this: I place the html files in the public folder of my apache server ( Server1 , same server used for crawling the local files) so that it can be accessed at http://mysite.com/page1.html When I run a search query on nutch jsp search page, the search results have a url which is a local filesystem path such as file:/home/htmlfiles/page1.html Is it possible to provide nutch with the local filesystem path in the urls folder for crawling and indexing files( a local filesystem path - /home/htmlfiles/page1.html) , But during query time from the nutch jsp, present to the search user the web url ( http://mysite.com/page1.html) Would this involve some kind of URL normalization in nutch? Ideally I would prefer to crawl the files from the localFS, than to have them crawled from the website root folder.I have noticed that crawling is much faster (since the files are local to nutch) than when I crawl from mysite.com, even though in both cases the files are on the same physical server. One obvious solution is to have nutch fetch the html pages from the mysite.com root folder and as a result the url will show up correctly as mysite.com/page.html when a search is performed on nutch. I have tried this and it works well, but the fetching speed is very slow and I would prefer to crawl the files using a file protocol which appears to much faster. Thank you for any advise and help. Regards taknev -- View this message in context: http://www.nabble.com/Crawling-local-filesystem-to-provide-search-access-from-web-tp17040516p17040516.html Sent from the Nutch - User mailing list archive at Nabble.com.
