Hi Winton, I found my problem. I was only editing crawl-urlfilter.txt and not regexp-urlfilter.txt
Thanks for the help. I have 2 questions: After i crawl my files, they will be indexed with file:///x/y/z/....... Is there an chance i can easily change the link prefix to http://somesite.com/ ? And i noticed from the tutorial, i only get one path to have nutch to serve searches for? http://peterpuwang.googlepages.com/NutchGuideForDummies.htm d. Set Your Searcher Directory Next, navigate to your nutch webapp folder then WEB-INF/classes. Edit the nutch-site.xml file and add the following to it (make sure you don't have two sets of <configuration></configuration> tags!): <configuration> <property> <name>searcher.dir</name> <value>your_crawl_folder_here</value> </property> </configuration> Can i have nutch search multiple crawl folders? Thanks again, -Ryan On Sat, Jul 5, 2008 at 7:17 PM, Winton Davies <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > I just used the regular intranet crawl, didnt try to do the inject > > W > > > At 6:16 PM -0400 7/5/08, Ryan Smith wrote: > >> Winton, >> I added the override property to nutch-site.xml ( i saw the one in >> nutch-default.xml after your email ) , still no urls being added to the >> crawldb. >> Can you verify this by trying to inject file urls into a test crawl db? >> Any other ideas? >> >> -Ryan >> >> On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <[EMAIL PROTECTED]> >> wrote: >> >> Hey Ryan, >>> >>> There's something else, that needs to be set as well - sorry I forgot >>> about >>> it. >>> >>> >>> <property> >>> <name>plugin.includes</name> >>> >>> >>> >>> <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >>> </property> >>> >>> >>> Hope this helps! >>> >>> W >>> >>> >>> >>> Hello, >>> >>>> I tried what Winton said. I generated a file with all the >>>> file:///x/y/z >>>> urls, but nutch wont inject any into the crawldb >>>> I even set the crawl-urlfilter.txt to allow everything: >>>> +. >>>> It seems like ./bin/nutch crawl is reading the file, but its finding >>>> 0 >>>> urls to fetch. I test this on http:// links and they get injected. >>>> Is there a plugin or something ic an modify to allow file urls to be >>>> injected into the crawldb? >>>> Thank you. >>>> -Ryan >>>> >>>> On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <[EMAIL PROTECTED] >>>> > >>>> wrote: >>>> >>>> Ryan, >>>> >>>>> >>>>> > You can generate a file of FILE urls (eg) >>>>> >>>> >>>> >>>>> file:///x/y/z/file1.html >>>>> file:///x/y/z/file2.html >>>>> >>>>> Use find and AWK accordingly to generate this. put it in the url >>>>> directory >>>>> and just set depth to 1, and change crawl_urlfilter.txt to admit >>>>> file:///x/y/z/ (note, if you dont head qualify it, it will apparently >>>>> try to >>>>> index directories above the base one, by using ../ notation. (I only >>>>> read >>>>> this, havent tried it). >>>>> >>>>> then just do the intranet crawl example. >>>>> >>>>> NOTE this will NOT (as far as I can see no matter how much tweaking), >>>>> use >>>>> ANCHOR TEXT or PageRank (OPIC version) for any links in these files. >>>>> The >>>>> ONLY way to do this is to use a webserver as far as I can tell. Don't >>>>> understand the logic, but there you are. Note, if you use a webserver, >>>>> be >>>>> aware you will have to disable IGNORE.INTERNAL setting in >>>>> Nutch-Site.xml >>>>> (you'll be messing around a lot in here). >>>>> >>>>> Cheers, >>>>> Winton >>>>> >>>>> >>>>> >>>>> >>>>> At 2:40 PM -0400 7/3/08, Ryan Smith wrote: >>>>> >>>>> Is there a simple way to have nutch index a folder full of other >>>>> >>>>>> folders >>>>>> and >>>>>> html files? >>>>>> >>>>>> I was hoping to avoid having to run apache to serve the html files, >>>>>> and >>>>>> then >>>>>> have nutch crawl the site on apache. >>>>>> >>>>>> Thank you, >>>>>> -Ryan >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >
