Nes Yarug wrote:
Hi all,
I'm new to Nutch and I have a few questions that I hope to get some
answers
on. Thanks in advance for any replies.
I want to use Nutch to index a web site I'm maintaining. I've followed
the
tutorial for intranet crawling and used a list of links (17420 links
to 8710
pages, each page has two unique links) from my site to crawl initially.
Actually, you don't need to provide a full list of links to Nutch. You
can let it discover links as it crawl your site, and constrain them
using crawl-urlfilter.txt and regex-urlfilter.txt
The
command I used was:
bin/nutch crawl urls -dir crawl -depth 20 -topN 100
The crawl completed, but I'm sure that when I was testing the search
it has
not indexed a lot of pages. What I understand from the following
command it
only indexed 1527 of 21378 pages:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 21378
retry 0: 20878
retry 1: 487
retry 2: 10
retry 3: 3
min score: 0.014
avg score: 84.405266
max score: 37106.03
status 1 (DB_unfetched): 19848
status 2 (DB_fetched): 1527
status 3 (DB_gone): 3
CrawlDb statistics: done
Now my questions:
1) Will Nutch automatically continue to index the rest of the URLs even
though te initial crawl finished (through some internal scheduler of some
sorts)?
You will need to refetch, or better: increase the depth, until "all your
pages" are fetched.
2) All of my site's pages at the moment are contained in two languages
(each
page has exactly two languages, the lang attribute on the html tag of
each
page contains the language identifier). When searching, is there a way to
only return pages in a specific language? I know the Nutch UI is
localised,
but it will still return pages in english if my UI language is German for
example. I want it to return German pages only (<html lang="de">) when
searching through the German UI. Is that possible?
try using "lang:" in your query, I'm not sure it's working, though...
From the javadoc: "LanguageQueryFilter.java should handles "lang:"
query clauses, causing them to search the "lang" field indexed by
LanguageIdentifier" (see also LanguageIndexingFilter.java).
HTH,
Renaud
--
renaud richardet +1 617 230 9112
renaud <at> oslutions.com http://www.oslutions.com