Nes Yarug wrote: > Hi all, > > I'm new to Nutch and I have a few questions that I hope to get some > answers > on. Thanks in advance for any replies. > > I want to use Nutch to index a web site I'm maintaining. I've followed > the > tutorial for intranet crawling and used a list of links (17420 links > to 8710 > pages, each page has two unique links) from my site to crawl initially. Actually, you don't need to provide a full list of links to Nutch. You can let it discover links as it crawl your site, and constrain them using crawl-urlfilter.txt and regex-urlfilter.txt > The > command I used was: > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100 > > The crawl completed, but I'm sure that when I was testing the search > it has > not indexed a lot of pages. What I understand from the following > command it > only indexed 1527 of 21378 pages: > > CrawlDb statistics start: crawl/crawldb > Statistics for CrawlDb: crawl/crawldb > TOTAL urls: 21378 > retry 0: 20878 > retry 1: 487 > retry 2: 10 > retry 3: 3 > min score: 0.014 > avg score: 84.405266 > max score: 37106.03 > status 1 (DB_unfetched): 19848 > status 2 (DB_fetched): 1527 > status 3 (DB_gone): 3 > CrawlDb statistics: done > > > Now my questions: > > 1) Will Nutch automatically continue to index the rest of the URLs even > though te initial crawl finished (through some internal scheduler of some > sorts)? You will need to refetch, or better: increase the depth, until "all your pages" are fetched. > > 2) All of my site's pages at the moment are contained in two languages > (each > page has exactly two languages, the lang attribute on the html tag of > each > page contains the language identifier). When searching, is there a way to > only return pages in a specific language? I know the Nutch UI is > localised, > but it will still return pages in english if my UI language is German for > example. I want it to return German pages only (<html lang="de">) when > searching through the German UI. Is that possible? try using "lang:" in your query, I'm not sure it's working, though... From the javadoc: "LanguageQueryFilter.java should handles "lang:" query clauses, causing them to search the "lang" field indexed by LanguageIdentifier" (see also LanguageIndexingFilter.java).
HTH, Renaud -- renaud richardet +1 617 230 9112 renaud <at> oslutions.com http://www.oslutions.com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
