Unless you haven't yet.. You need to activate index-more and
query-more plugin in nutch-site.xml

You can also check the "explan link"  from the search results page and
you will see "lang" is missing if you haven't activated the index-more
and query-more plugin..

Cheers

On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:
> Thank you everyone for your replies.
>
> I have implemented the recrawl script from
> http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for
> over 12 hours so I guess that  would index much more pages.
>
> Leaves the question about language specific search. I have tried adding the
> lang: clause to my search query by appending lang:en but that is not
> returning any results (as if lang:en would become part of the actual query).
> The url then looks like this: search.jsp
> ?query=help+lang%3Aen&hitsPerPage=10&lang=en
>
> Anyone has used a language specific search before, do I need to add a new
> (hidden) input field on the search form to specifiy the language instead of
> appending it to the query? That would be my preference anyway, as I want the
> language specific search to be transparant to he user.
>
> Again, many thanks for any replies,
> Nes
>
> On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
> >
> > Nes Yarug wrote:
> > > Hi all,
> > >
> > > I'm new to Nutch and I have a few questions that I hope to get some
> > > answers
> > > on. Thanks in advance for any replies.
> > >
> > > I want to use Nutch to index a web site I'm maintaining. I've followed
> > > the
> > > tutorial for intranet crawling and used a list of links (17420 links
> > > to 8710
> > > pages, each page has two unique links) from my site to crawl initially.
> > Actually, you don't need to provide a full list of links to Nutch. You
> > can let it discover links as it crawl your site, and constrain them
> > using crawl-urlfilter.txt and regex-urlfilter.txt
> > > The
> > > command I used was:
> > >
> > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> > >
> > > The crawl completed, but I'm sure that when I was testing the search
> > > it has
> > > not indexed a lot of pages. What I understand from the following
> > > command it
> > > only indexed 1527 of 21378 pages:
> > >
> > > CrawlDb statistics start: crawl/crawldb
> > > Statistics for CrawlDb: crawl/crawldb
> > > TOTAL urls:     21378
> > > retry 0:        20878
> > > retry 1:        487
> > > retry 2:        10
> > > retry 3:        3
> > > min score:      0.014
> > > avg score:      84.405266
> > > max score:      37106.03
> > > status 1 (DB_unfetched):        19848
> > > status 2 (DB_fetched):  1527
> > > status 3 (DB_gone):     3
> > > CrawlDb statistics: done
> > >
> > >
> > > Now my questions:
> > >
> > > 1) Will Nutch automatically continue to index the rest of the URLs even
> > > though te initial crawl finished (through some internal scheduler of
> > some
> > > sorts)?
> > You will need to refetch, or better: increase the depth, until "all your
> > pages" are fetched.
> > >
> > > 2) All of my site's pages at the moment are contained in two languages
> > > (each
> > > page has exactly two languages, the lang attribute on the html tag of
> > > each
> > > page contains the language identifier). When searching, is there a way
> > to
> > > only return pages in a specific language? I know the Nutch UI is
> > > localised,
> > > but it will still return pages in english if my UI language is German
> > for
> > > example. I want it to return German pages only (<html lang="de">) when
> > > searching through the German UI. Is that possible?
> > try using "lang:" in your query, I'm not sure it's working, though...
> > From the javadoc: "LanguageQueryFilter.java should handles "lang:"
> > query clauses, causing them to search the "lang" field indexed by
> > LanguageIdentifier" (see also LanguageIndexingFilter.java).
> >
> > HTH,
> > Renaud
> >
> >
> > --
> > renaud richardet                           +1 617 230 9112
> > renaud <at> oslutions.com         http://www.oslutions.com
> >
> >
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to