Re: [Nutch-general] New to Nutch, a few questions

Nes Yarug Thu, 01 Feb 2007 03:49:39 -0800

Okay, thanks for that. I have updated my configuration and I will now
re-index the site. I'll let you know how it goes.


Many thanks,
Nes

On 1/31/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:


As Zaheed pointed out, "You need to activate index-more and query-more
plugin in nutch-site.xml"

So, copy the entry "plugin.includes" from nutch-defaults.xml, add
index-more and query-lang, and insert it in your nutch-site.xml. You
should have something like this:

<property>
  <name>plugin.includes</name>


<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|more)|query-(basic|site|url|lang)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin.
By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

HTH,
Renaud


Nes Yarug wrote:
> Oops, my previous post should read "I have NOT explicitely activated
> those
> plugins"
>
> On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:
>>
>> I have explicitely activated those plugins. Could you tell me how to do
>> that with an example as I looked through conf/nutch-default.xml and
>> couldn't find any references to it. I'm using 0.8.1 by the way. They
are
>> enabled in the build I guess as default.properties is listing them:
>>
>> #
>> # Indexing Filter Plugins
>> #
>> plugins.index=\
>>    org.apache.nutch.indexer.basic*:\
>>    org.apache.nutch.indexer.more*
>>
>> #
>> # Query Filter Plugins
>> #
>> plugins.query=\
>>    org.apache.nutch.searcher.basic*:\
>>    org.apache.nutch.searcher.more*:\
>>    org.apache.nutch.searcher.site*:\
>>    org.apache.nutch.searcher.url*
>>
>> Many thanks,
>> Nes
>>
>> On 1/31/07, Zaheed Haque <[EMAIL PROTECTED]> wrote:
>> >
>> > Unless you haven't yet.. You need to activate index-more and
>> > query-more plugin in nutch-site.xml
>> >
>> > You can also check the "explan link"  from the search results page
and
>> > you will see "lang" is missing if you haven't activated the
index-more
>> > and query-more plugin..
>> >
>> > Cheers
>> >
>> > On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:
>> > > Thank you everyone for your replies.
>> > >
>> > > I have implemented the recrawl script from
>> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still
>> running
>> > for
>> > > over 12 hours so I guess that  would index much more pages.
>> > >
>> > > Leaves the question about language specific search. I have tried
>> > adding the
>> > > lang: clause to my search query by appending lang:en but that is
not
>> > > returning any results (as if lang:en would become part of the
actual
>> > query).
>> > > The url then looks like this: search.jsp
>> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
>> > >
>> > > Anyone has used a language specific search before, do I need to
>> add a
>> > new
>> > > (hidden) input field on the search form to specifiy the language
>> > instead of
>> > > appending it to the query? That would be my preference anyway, as I
>> > want the
>> > > language specific search to be transparant to he user.
>> > >
>> > > Again, many thanks for any replies,
>> > > Nes
>> > >
>> > > On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
>> > > >
>> > > > Nes Yarug wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > I'm new to Nutch and I have a few questions that I hope to get
>> > some
>> > > > > answers
>> > > > > on. Thanks in advance for any replies.
>> > > > >
>> > > > > I want to use Nutch to index a web site I'm maintaining. I've
>> > followed
>> > > > > the
>> > > > > tutorial for intranet crawling and used a list of links (17420
>> > links
>> > > > > to 8710
>> > > > > pages, each page has two unique links) from my site to crawl
>> > initially.
>> > > > Actually, you don't need to provide a full list of links to
Nutch.
>> > You
>> > > > can let it discover links as it crawl your site, and constrain
>> them
>> > > > using crawl-urlfilter.txt and regex-urlfilter.txt
>> > > > > The
>> > > > > command I used was:
>> > > > >
>> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
>> > > > >
>> > > > > The crawl completed, but I'm sure that when I was testing the
>> > search
>> > > > > it has
>> > > > > not indexed a lot of pages. What I understand from the
following
>> > > > > command it
>> > > > > only indexed 1527 of 21378 pages:
>> > > > >
>> > > > > CrawlDb statistics start: crawl/crawldb
>> > > > > Statistics for CrawlDb: crawl/crawldb
>> > > > > TOTAL urls:     21378
>> > > > > retry 0:        20878
>> > > > > retry 1:        487
>> > > > > retry 2:        10
>> > > > > retry 3:        3
>> > > > > min score:      0.014
>> > > > > avg score:       84.405266
>> > > > > max score:      37106.03
>> > > > > status 1 (DB_unfetched):        19848
>> > > > > status 2 (DB_fetched):  1527
>> > > > > status 3 (DB_gone):     3
>> > > > > CrawlDb statistics: done
>> > > > >
>> > > > >
>> > > > > Now my questions:
>> > > > >
>> > > > > 1) Will Nutch automatically continue to index the rest of the
>> URLs
>> > even
>> > > > > though te initial crawl finished (through some internal
>> scheduler
>> > of
>> > > > some
>> > > > > sorts)?
>> > > > You will need to refetch, or better: increase the depth, until
>> "all
>> > your
>> > > > pages" are fetched.
>> > > > >
>> > > > > 2) All of my site's pages at the moment are contained in two
>> > languages
>> > > > > (each
>> > > > > page has exactly two languages, the lang attribute on the
>> html tag
>> > of
>> > > > > each
>> > > > > page contains the language identifier). When searching, is
>> there a
>> > way
>> > > > to
>> > > > > only return pages in a specific language? I know the Nutch UI
is
>> > > > > localised,
>> > > > > but it will still return pages in english if my UI language is
>> > German
>> > > > for
>> > > > > example. I want it to return German pages only (<html
>> lang="de">)
>> > when
>> > > > > searching through the German UI. Is that possible?
>> > > > try using "lang:" in your query, I'm not sure it's working,
>> > though...
>> > > > From the javadoc: "LanguageQueryFilter.java should handles
"lang:"
>> > > > query clauses, causing them to search the "lang" field indexed by
>> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
>> > > >
>> > > > HTH,
>> > > > Renaud
>> > > >
>> > > >
>> > > > --
>> > > > renaud richardet                           +1 617 230 9112
>> > > > renaud <at> oslutions.com         http://www.oslutions.com
>> > > >
>> > > >
>> > >
>> > >
>> >
>>
>>
>


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] New to Nutch, a few questions

Reply via email to