Re: [Nutch-general] Crawling local files?

praveen pathiyil Wed, 27 Jul 2005 10:23:27 -0700

Hi,

It is definitely possible. Feed the urls in the form of
file:/<absolute directory path> format in the 'urls' file.

Be sure to change the crawl-urlfilter to accept 'file:' urls {by
default file: links are filtered]. Also you might want to avoid
accepting http:// links if you know that all you want is in your local
machine. In this way, the crawler would not go out and start fetching
internet content.

The only issue with this is that the URLs returned as part of the
search result will be of the same form. i.e, file:/<path>. Hence you
might want to add some transformation from this to the corresponding
http url as would be served by the web-server.

HTH,
Praveen.

On 7/27/05, Vacuum Joe <[EMAIL PROTECTED]> wrote:
> I have a web server with about 4gb of static HTML
> files on it.  Is there a way to get Nutch to crawl
> those files directly from the filesystem, without
> going through the web server?  Obviously, I could have
> it go through the web server to do this, but the crawl
> is going to be much faster if it could just read the
> files directly from the disk.  Is it possible?
> 
> Thanks
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Re: [Nutch-general] Crawling local files?

Reply via email to