New to Nutch, a few questions

Nes Yarug Tue, 30 Jan 2007 05:38:11 -0800

Hi all,

I'm new to Nutch and I have a few questions that I hope to get some answers
on. Thanks in advance for any replies.


I want to use Nutch to index a web site I'm maintaining. I've followed the
tutorial for intranet crawling and used a list of links (17420 links to 8710
pages, each page has two unique links) from my site to crawl initially. The
command I used was:

bin/nutch crawl urls -dir crawl -depth 20 -topN 100

The crawl completed, but I'm sure that when I was testing the search it has
not indexed a lot of pages. What I understand from the following command it
only indexed 1527 of 21378 pages:

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     21378
retry 0:        20878
retry 1:        487
retry 2:        10
retry 3:        3
min score:      0.014
avg score:      84.405266
max score:      37106.03
status 1 (DB_unfetched):        19848
status 2 (DB_fetched):  1527
status 3 (DB_gone):     3
CrawlDb statistics: done


Now my questions:

1) Will Nutch automatically continue to index the rest of the URLs even
though te initial crawl finished (through some internal scheduler of some
sorts)?

2) All of my site's pages at the moment are contained in two languages (each
page has exactly two languages, the lang attribute on the html tag of each
page contains the language identifier). When searching, is there a way to
only return pages in a specific language? I know the Nutch UI is localised,
but it will still return pages in english if my UI language is German for
example. I want it to return German pages only (<html lang="de">) when
searching through the German UI. Is that possible?

Many thanks,
Nes

New to Nutch, a few questions

Reply via email to