Unless you haven't yet.. You need to activate index-more and query-more plugin in nutch-site.xml
You can also check the "explan link" from the search results page and you will see "lang" is missing if you haven't activated the index-more and query-more plugin.. Cheers On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:
Thank you everyone for your replies. I have implemented the recrawl script from http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for over 12 hours so I guess that would index much more pages. Leaves the question about language specific search. I have tried adding the lang: clause to my search query by appending lang:en but that is not returning any results (as if lang:en would become part of the actual query). The url then looks like this: search.jsp ?query=help+lang%3Aen&hitsPerPage=10&lang=en Anyone has used a language specific search before, do I need to add a new (hidden) input field on the search form to specifiy the language instead of appending it to the query? That would be my preference anyway, as I want the language specific search to be transparant to he user. Again, many thanks for any replies, Nes On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote: > > Nes Yarug wrote: > > Hi all, > > > > I'm new to Nutch and I have a few questions that I hope to get some > > answers > > on. Thanks in advance for any replies. > > > > I want to use Nutch to index a web site I'm maintaining. I've followed > > the > > tutorial for intranet crawling and used a list of links (17420 links > > to 8710 > > pages, each page has two unique links) from my site to crawl initially. > Actually, you don't need to provide a full list of links to Nutch. You > can let it discover links as it crawl your site, and constrain them > using crawl-urlfilter.txt and regex-urlfilter.txt > > The > > command I used was: > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100 > > > > The crawl completed, but I'm sure that when I was testing the search > > it has > > not indexed a lot of pages. What I understand from the following > > command it > > only indexed 1527 of 21378 pages: > > > > CrawlDb statistics start: crawl/crawldb > > Statistics for CrawlDb: crawl/crawldb > > TOTAL urls: 21378 > > retry 0: 20878 > > retry 1: 487 > > retry 2: 10 > > retry 3: 3 > > min score: 0.014 > > avg score: 84.405266 > > max score: 37106.03 > > status 1 (DB_unfetched): 19848 > > status 2 (DB_fetched): 1527 > > status 3 (DB_gone): 3 > > CrawlDb statistics: done > > > > > > Now my questions: > > > > 1) Will Nutch automatically continue to index the rest of the URLs even > > though te initial crawl finished (through some internal scheduler of > some > > sorts)? > You will need to refetch, or better: increase the depth, until "all your > pages" are fetched. > > > > 2) All of my site's pages at the moment are contained in two languages > > (each > > page has exactly two languages, the lang attribute on the html tag of > > each > > page contains the language identifier). When searching, is there a way > to > > only return pages in a specific language? I know the Nutch UI is > > localised, > > but it will still return pages in english if my UI language is German > for > > example. I want it to return German pages only (<html lang="de">) when > > searching through the German UI. Is that possible? > try using "lang:" in your query, I'm not sure it's working, though... > From the javadoc: "LanguageQueryFilter.java should handles "lang:" > query clauses, causing them to search the "lang" field indexed by > LanguageIdentifier" (see also LanguageIndexingFilter.java). > > HTH, > Renaud > > > -- > renaud richardet +1 617 230 9112 > renaud <at> oslutions.com http://www.oslutions.com > >
