Winton, I added the override property to nutch-site.xml ( i saw the one in nutch-default.xml after your email ) , still no urls being added to the crawldb. Can you verify this by trying to inject file urls into a test crawl db? Any other ideas?
-Ryan On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <[EMAIL PROTECTED]> wrote: > Hey Ryan, > > There's something else, that needs to be set as well - sorry I forgot about > it. > > > <property> > <name>plugin.includes</name> > > <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > </property> > > > Hope this helps! > > W > > > > Hello, >> I tried what Winton said. I generated a file with all the file:///x/y/z >> urls, but nutch wont inject any into the crawldb >> I even set the crawl-urlfilter.txt to allow everything: >> +. >> It seems like ./bin/nutch crawl is reading the file, but its finding 0 >> urls to fetch. I test this on http:// links and they get injected. >> Is there a plugin or something ic an modify to allow file urls to be >> injected into the crawldb? >> Thank you. >> -Ryan >> >> On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <[EMAIL PROTECTED]> >> wrote: >> >> Ryan, >>> >>> > You can generate a file of FILE urls (eg) >> >>> >>> file:///x/y/z/file1.html >>> file:///x/y/z/file2.html >>> >>> Use find and AWK accordingly to generate this. put it in the url >>> directory >>> and just set depth to 1, and change crawl_urlfilter.txt to admit >>> file:///x/y/z/ (note, if you dont head qualify it, it will apparently >>> try to >>> index directories above the base one, by using ../ notation. (I only >>> read >>> this, havent tried it). >>> >>> then just do the intranet crawl example. >>> >>> NOTE this will NOT (as far as I can see no matter how much tweaking), >>> use >>> ANCHOR TEXT or PageRank (OPIC version) for any links in these files. The >>> ONLY way to do this is to use a webserver as far as I can tell. Don't >>> understand the logic, but there you are. Note, if you use a webserver, >>> be >>> aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml >>> (you'll be messing around a lot in here). >>> >>> Cheers, >>> Winton >>> >>> >>> >>> >>> At 2:40 PM -0400 7/3/08, Ryan Smith wrote: >>> >>> Is there a simple way to have nutch index a folder full of other >>>> folders >>>> and >>>> html files? >>>> >>>> I was hoping to avoid having to run apache to serve the html files, and >>>> then >>>> have nutch crawl the site on apache. >>>> >>>> Thank you, >>>> -Ryan >>>> >>>> >>> >>> >
