Unless you haven't yet.. You need to activate index-more and query-more plugin in nutch-site.xml
You can also check the "explan link" from the search results page and you will see "lang" is missing if you haven't activated the index-more and query-more plugin.. Cheers On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote: > Thank you everyone for your replies. > > I have implemented the recrawl script from > http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for > over 12 hours so I guess that would index much more pages. > > Leaves the question about language specific search. I have tried adding the > lang: clause to my search query by appending lang:en but that is not > returning any results (as if lang:en would become part of the actual query). > The url then looks like this: search.jsp > ?query=help+lang%3Aen&hitsPerPage=10&lang=en > > Anyone has used a language specific search before, do I need to add a new > (hidden) input field on the search form to specifiy the language instead of > appending it to the query? That would be my preference anyway, as I want the > language specific search to be transparant to he user. > > Again, many thanks for any replies, > Nes > > On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote: > > > > Nes Yarug wrote: > > > Hi all, > > > > > > I'm new to Nutch and I have a few questions that I hope to get some > > > answers > > > on. Thanks in advance for any replies. > > > > > > I want to use Nutch to index a web site I'm maintaining. I've followed > > > the > > > tutorial for intranet crawling and used a list of links (17420 links > > > to 8710 > > > pages, each page has two unique links) from my site to crawl initially. > > Actually, you don't need to provide a full list of links to Nutch. You > > can let it discover links as it crawl your site, and constrain them > > using crawl-urlfilter.txt and regex-urlfilter.txt > > > The > > > command I used was: > > > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100 > > > > > > The crawl completed, but I'm sure that when I was testing the search > > > it has > > > not indexed a lot of pages. What I understand from the following > > > command it > > > only indexed 1527 of 21378 pages: > > > > > > CrawlDb statistics start: crawl/crawldb > > > Statistics for CrawlDb: crawl/crawldb > > > TOTAL urls: 21378 > > > retry 0: 20878 > > > retry 1: 487 > > > retry 2: 10 > > > retry 3: 3 > > > min score: 0.014 > > > avg score: 84.405266 > > > max score: 37106.03 > > > status 1 (DB_unfetched): 19848 > > > status 2 (DB_fetched): 1527 > > > status 3 (DB_gone): 3 > > > CrawlDb statistics: done > > > > > > > > > Now my questions: > > > > > > 1) Will Nutch automatically continue to index the rest of the URLs even > > > though te initial crawl finished (through some internal scheduler of > > some > > > sorts)? > > You will need to refetch, or better: increase the depth, until "all your > > pages" are fetched. > > > > > > 2) All of my site's pages at the moment are contained in two languages > > > (each > > > page has exactly two languages, the lang attribute on the html tag of > > > each > > > page contains the language identifier). When searching, is there a way > > to > > > only return pages in a specific language? I know the Nutch UI is > > > localised, > > > but it will still return pages in english if my UI language is German > > for > > > example. I want it to return German pages only (<html lang="de">) when > > > searching through the German UI. Is that possible? > > try using "lang:" in your query, I'm not sure it's working, though... > > From the javadoc: "LanguageQueryFilter.java should handles "lang:" > > query clauses, causing them to search the "lang" field indexed by > > LanguageIdentifier" (see also LanguageIndexingFilter.java). > > > > HTH, > > Renaud > > > > > > -- > > renaud richardet +1 617 230 9112 > > renaud <at> oslutions.com http://www.oslutions.com > > > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
