Hi all, I'm new to Nutch and I have a few questions that I hope to get some answers on. Thanks in advance for any replies.
I want to use Nutch to index a web site I'm maintaining. I've followed the tutorial for intranet crawling and used a list of links (17420 links to 8710 pages, each page has two unique links) from my site to crawl initially. The command I used was: bin/nutch crawl urls -dir crawl -depth 20 -topN 100 The crawl completed, but I'm sure that when I was testing the search it has not indexed a lot of pages. What I understand from the following command it only indexed 1527 of 21378 pages: CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 21378 retry 0: 20878 retry 1: 487 retry 2: 10 retry 3: 3 min score: 0.014 avg score: 84.405266 max score: 37106.03 status 1 (DB_unfetched): 19848 status 2 (DB_fetched): 1527 status 3 (DB_gone): 3 CrawlDb statistics: done Now my questions: 1) Will Nutch automatically continue to index the rest of the URLs even though te initial crawl finished (through some internal scheduler of some sorts)? 2) All of my site's pages at the moment are contained in two languages (each page has exactly two languages, the lang attribute on the html tag of each page contains the language identifier). When searching, is there a way to only return pages in a specific language? I know the Nutch UI is localised, but it will still return pages in english if my UI language is German for example. I want it to return German pages only (<html lang="de">) when searching through the German UI. Is that possible? Many thanks, Nes
