Hello,

I am using nutch-0.9 for indexing html files which are present on the same 
server ( Server1)  local as nutch. Thus, I am using the protocol-file for
fetching and subsequently indexing. This is working out just great.

My problem is this:

I place the html files in the public folder of my apache server ( Server1 ,
same server used for crawling the local files)  so that it can be accessed
at http://mysite.com/page1.html   

When I run a search query on nutch jsp search page, the search results have
a url which is a local filesystem path such as
file:/home/htmlfiles/page1.html


Is it possible to provide nutch with  the local filesystem path  in the urls
folder for crawling and indexing files( a local filesystem path -
/home/htmlfiles/page1.html) , But during query time from the nutch jsp, 
present to the search user the web url ( http://mysite.com/page1.html)

Would this involve some kind of URL normalization in nutch?


Ideally I would prefer to crawl the files from  the localFS, than to have
them crawled from the website root folder.I have noticed that crawling is
much faster (since the files are local to nutch)  than when I crawl from
mysite.com, even though in both cases the files are on the same physical
server.

One  obvious solution is to have nutch fetch the html pages from the
mysite.com root folder and as a result the url will show up correctly as
mysite.com/page.html when a search is performed on nutch. I have tried this
and it works well, but the fetching speed is very slow and I would prefer to
crawl the files using a file protocol which appears to much faster.

Thank you for any advise and help.

Regards

taknev 





-- 
View this message in context: 
http://www.nabble.com/Crawling-local-filesystem-to-provide-search-access-from-web-tp17040516p17040516.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to