Re: Crawling local filesystem to provide search access from web

ogjunk-nutch Sat, 03 May 2008 18:57:27 -0700

Hi,

There is no automated way to switch from file:/... to http://... but I imagine 
you could easily change the JSP that handler search results display and add a 
little JSP scriptlet that does url = url.replace("file:/....", 
"http://mysite.com/";) type of thing.
 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: ivrokv <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Sunday, May 4, 2008 12:24:12 AM
> Subject: Crawling local filesystem to provide search access from web
> 
> 
> Hello,
> 
> I am using nutch-0.9 for indexing html files which are present on the same 
> server ( Server1)  local as nutch. Thus, I am using the protocol-file for
> fetching and subsequently indexing. This is working out just great.
> 
> My problem is this:
> 
> I place the html files in the public folder of my apache server ( Server1 ,
> same server used for crawling the local files)  so that it can be accessed
> at http://mysite.com/page1.html   
> 
> When I run a search query on nutch jsp search page, the search results have
> a url which is a local filesystem path such as
> file:/home/htmlfiles/page1.html
> 
> 
> Is it possible to provide nutch with  the local filesystem path  in the urls
> folder for crawling and indexing files( a local filesystem path -
> /home/htmlfiles/page1.html) , But during query time from the nutch jsp, 
> present to the search user the web url ( http://mysite.com/page1.html)
> 
> Would this involve some kind of URL normalization in nutch?
> 
> 
> Ideally I would prefer to crawl the files from  the localFS, than to have
> them crawled from the website root folder.I have noticed that crawling is
> much faster (since the files are local to nutch)  than when I crawl from
> mysite.com, even though in both cases the files are on the same physical
> server.
> 
> One  obvious solution is to have nutch fetch the html pages from the
> mysite.com root folder and as a result the url will show up correctly as
> mysite.com/page.html when a search is performed on nutch. I have tried this
> and it works well, but the fetching speed is very slow and I would prefer to
> crawl the files using a file protocol which appears to much faster.
> 
> Thank you for any advise and help.
> 
> Regards
> 
> taknev 
> 
> 
> 
> 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Crawling-local-filesystem-to-provide-search-access-from-web-tp17040516p17040516.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
>

Re: Crawling local filesystem to provide search access from web

Reply via email to