Hello, I tried what Winton said. I generated a file with all the file:///x/y/z urls, but nutch wont inject any into the crawldb I even set the crawl-urlfilter.txt to allow everything: +. It seems like ./bin/nutch crawl is reading the file, but its finding 0 urls to fetch. I test this on http:// links and they get injected. Is there a plugin or something ic an modify to allow file urls to be injected into the crawldb? Thank you. -Ryan
On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <[EMAIL PROTECTED]> wrote: > Ryan, > > You can generate a file of FILE urls (eg) > > file:///x/y/z/file1.html > file:///x/y/z/file2.html > > Use find and AWK accordingly to generate this. put it in the url directory > and just set depth to 1, and change crawl_urlfilter.txt to admit > file:///x/y/z/ (note, if you dont head qualify it, it will apparently try to > index directories above the base one, by using ../ notation. (I only read > this, havent tried it). > > then just do the intranet crawl example. > > NOTE this will NOT (as far as I can see no matter how much tweaking), use > ANCHOR TEXT or PageRank (OPIC version) for any links in these files. The > ONLY way to do this is to use a webserver as far as I can tell. Don't > understand the logic, but there you are. Note, if you use a webserver, be > aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml > (you'll be messing around a lot in here). > > Cheers, > Winton > > > > > At 2:40 PM -0400 7/3/08, Ryan Smith wrote: > >> Is there a simple way to have nutch index a folder full of other folders >> and >> html files? >> >> I was hoping to avoid having to run apache to serve the html files, and >> then >> have nutch crawl the site on apache. >> >> Thank you, >> -Ryan >> > >
