Nes Yarug wrote:
> Hi all,
>
> I'm new to Nutch and I have a few questions that I hope to get some answers
> on. Thanks in advance for any replies.
>
> I want to use Nutch to index a web site I'm maintaining. I've followed the
> tutorial for intranet crawling and used a list of links (17420 links to
> 8710
> pages, each page has two unique links) from my site to crawl initially. The
> command I used was:
>
> bin/nutch crawl urls -dir crawl -depth 20 -topN 100
Here you are using topN. This will only pull the top 100 results to
fetch on the next depth. You probably also don't need a depth of 20.
Starting from your homepage, what is the most number of clicks it would
take to get to any page in your site. This should be your depth. If
you eliminate this topN I think you will be able to get all of your pages.
>
> The crawl completed, but I'm sure that when I was testing the search it has
> not indexed a lot of pages. What I understand from the following command it
> only indexed 1527 of 21378 pages:
>
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls: 21378
> retry 0: 20878
> retry 1: 487
> retry 2: 10
> retry 3: 3
> min score: 0.014
> avg score: 84.405266
> max score: 37106.03
> status 1 (DB_unfetched): 19848
> status 2 (DB_fetched): 1527
> status 3 (DB_gone): 3
> CrawlDb statistics: done
>
>
> Now my questions:
>
> 1) Will Nutch automatically continue to index the rest of the URLs even
> though te initial crawl finished (through some internal scheduler of some
> sorts)?
Not with the topN set like that no. You could also change it from 100
to say 5000 but I still think that wouldn't get all the pages. Better
leaving it off, especially if you are only indexing a single site.
>
> 2) All of my site's pages at the moment are contained in two languages
> (each
> page has exactly two languages, the lang attribute on the html tag of each
> page contains the language identifier). When searching, is there a way to
> only return pages in a specific language? I know the Nutch UI is localised,
> but it will still return pages in english if my UI language is German for
> example. I want it to return German pages only (<html lang="de">) when
> searching through the German UI. Is that possible?
I believe the lang attribute is put in as a field during indexing
(depends on your settings but I believe this is default) and then you
can add a required field to the query in the search.jsp for the language
like this:
query.addRequiredTerm("en", "lang"); // substitute language for en
>
> Many thanks,
> Nes
>
Dennis Kubes
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general