Re: New to Nutch, a few questions

Zaheed Haque Wed, 31 Jan 2007 03:25:53 -0800

Unless you haven't yet.. You need to activate index-more and
query-more plugin in nutch-site.xml


You can also check the "explan link"  from the search results page and
you will see "lang" is missing if you haven't activated the index-more
and query-more plugin..

Cheers

On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:

Thank you everyone for your replies.

I have implemented the recrawl script from
http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for
over 12 hours so I guess that  would index much more pages.

Leaves the question about language specific search. I have tried adding the
lang: clause to my search query by appending lang:en but that is not
returning any results (as if lang:en would become part of the actual query).
The url then looks like this: search.jsp
?query=help+lang%3Aen&hitsPerPage=10&lang=en

Anyone has used a language specific search before, do I need to add a new
(hidden) input field on the search form to specifiy the language instead of
appending it to the query? That would be my preference anyway, as I want the
language specific search to be transparant to he user.

Again, many thanks for any replies,
Nes

On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
>
> Nes Yarug wrote:
> > Hi all,
> >
> > I'm new to Nutch and I have a few questions that I hope to get some
> > answers
> > on. Thanks in advance for any replies.
> >
> > I want to use Nutch to index a web site I'm maintaining. I've followed
> > the
> > tutorial for intranet crawling and used a list of links (17420 links
> > to 8710
> > pages, each page has two unique links) from my site to crawl initially.
> Actually, you don't need to provide a full list of links to Nutch. You
> can let it discover links as it crawl your site, and constrain them
> using crawl-urlfilter.txt and regex-urlfilter.txt
> > The
> > command I used was:
> >
> > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
> >
> > The crawl completed, but I'm sure that when I was testing the search
> > it has
> > not indexed a lot of pages. What I understand from the following
> > command it
> > only indexed 1527 of 21378 pages:
> >
> > CrawlDb statistics start: crawl/crawldb
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls:     21378
> > retry 0:        20878
> > retry 1:        487
> > retry 2:        10
> > retry 3:        3
> > min score:      0.014
> > avg score:      84.405266
> > max score:      37106.03
> > status 1 (DB_unfetched):        19848
> > status 2 (DB_fetched):  1527
> > status 3 (DB_gone):     3
> > CrawlDb statistics: done
> >
> >
> > Now my questions:
> >
> > 1) Will Nutch automatically continue to index the rest of the URLs even
> > though te initial crawl finished (through some internal scheduler of
> some
> > sorts)?
> You will need to refetch, or better: increase the depth, until "all your
> pages" are fetched.
> >
> > 2) All of my site's pages at the moment are contained in two languages
> > (each
> > page has exactly two languages, the lang attribute on the html tag of
> > each
> > page contains the language identifier). When searching, is there a way
> to
> > only return pages in a specific language? I know the Nutch UI is
> > localised,
> > but it will still return pages in english if my UI language is German
> for
> > example. I want it to return German pages only (<html lang="de">) when
> > searching through the German UI. Is that possible?
> try using "lang:" in your query, I'm not sure it's working, though...
> From the javadoc: "LanguageQueryFilter.java should handles "lang:"
> query clauses, causing them to search the "lang" field indexed by
> LanguageIdentifier" (see also LanguageIndexingFilter.java).
>
> HTH,
> Renaud
>
>
> --
> renaud richardet                           +1 617 230 9112
> renaud <at> oslutions.com         http://www.oslutions.com
>
>

Re: New to Nutch, a few questions

Reply via email to