Nes Yarug wrote:
Hi all,

I'm new to Nutch and I have a few questions that I hope to get some answers
on. Thanks in advance for any replies.

I want to use Nutch to index a web site I'm maintaining. I've followed the tutorial for intranet crawling and used a list of links (17420 links to 8710 pages, each page has two unique links) from my site to crawl initially.
Actually, you don't need to provide a full list of links to Nutch. You can let it discover links as it crawl your site, and constrain them using crawl-urlfilter.txt and regex-urlfilter.txt
The
command I used was:

bin/nutch crawl urls -dir crawl -depth 20 -topN 100

The crawl completed, but I'm sure that when I was testing the search it has not indexed a lot of pages. What I understand from the following command it
only indexed 1527 of 21378 pages:

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     21378
retry 0:        20878
retry 1:        487
retry 2:        10
retry 3:        3
min score:      0.014
avg score:      84.405266
max score:      37106.03
status 1 (DB_unfetched):        19848
status 2 (DB_fetched):  1527
status 3 (DB_gone):     3
CrawlDb statistics: done


Now my questions:

1) Will Nutch automatically continue to index the rest of the URLs even
though te initial crawl finished (through some internal scheduler of some
sorts)?
You will need to refetch, or better: increase the depth, until "all your pages" are fetched.

2) All of my site's pages at the moment are contained in two languages (each page has exactly two languages, the lang attribute on the html tag of each
page contains the language identifier). When searching, is there a way to
only return pages in a specific language? I know the Nutch UI is localised,
but it will still return pages in english if my UI language is German for
example. I want it to return German pages only (<html lang="de">) when
searching through the German UI. Is that possible?
try using "lang:" in your query, I'm not sure it's working, though...
From the javadoc: "LanguageQueryFilter.java should handles "lang:" query clauses, causing them to search the "lang" field indexed by LanguageIdentifier" (see also LanguageIndexingFilter.java).

HTH,
Renaud


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

Reply via email to