Hi, I'm a new nutch user. My company wants me to look into using this technology to index our internal wiki website as well as sharepoint docs (using tika).
Right now, I just want nutch to index the entire wiki site but I'm having problems. I've read other people's problems with this but I haven't found a solution that worked for me. I have nutch 1.0 installed. The wiki site is MoinMoin if that helps. The pages don't have extensions like .html. They're in the form of http://wiki:8000/Engineering as an example. So all pages only have 1-level depth paths. I'm running nutch with the follow command: bin/nuch crawl urls -dir crawl -depth 100 -topN 1000000 > crawl.log I have a urls folder with a file called wiki that points to the top-level page of the site. I set the crawl-urlfilter.txt to accept everything except the default exclusions: -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[...@=] -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +. And I set the db.ignore.external.links property in nutch-default.xml to true so it doesn't go outside of the site. (db.ignore.interal.links is set to false) After the crawl command completes, the search returns some pages, but there are still some pages that are maybe 2 or 3 levels from the starting page that don't show up on search. Any help would be appreciated. Thanks, Kane -- View this message in context: http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p27908943.html Sent from the Nutch - User mailing list archive at Nabble.com.