Re: [Nutch-dev] Feature request - pluggable Analyzer

David Wallace Wed, 13 Apr 2005 14:37:09 -0700

OK Jack, but the details of my analyser aren't particularly exciting.  

I need to index a site that has a mixture of documents in English and
Te Reo Maori (indigenous language of New Zealand).  Vowels in Te Reo
Maori are sometimes written with short overlines (also known as
macrons), to indicate a long vowel.  However, some authors omit the
macrons and others use umlauts (two dots) instead of macrons.  People
searching ought to be able to find the documents regardless of whether
the document uses macrons, umlauts or neither; and regardless of whether
they type the search term with or without macrons or umlauts.  However,
when the document excerpts are displayed, I need the macrons to be
included if they're present in the documents.

Therefore, I need the segment to include the original text, with
macrons; but for the index keys (that is, the tokens returned by the
analyser) to have the macrons stripped.  The same stripping is required
when the excerpt is generated, so that the appropriate passages from the
document are included in the excerpt.  It's also necessary to strip
macrons and umlauts out of whatever search term the user enters in the
search interface.

The custom analyser that I have written works by keeping its own
NutchDocumentAnalyzer, passing documents to it, intercepting the token
stream that the NutchDocumentAnalyzer returns, and stripping the macrons
and umlauts out of the tokens.  It's not rocket science and it works
well.  The point is, I need both IndexSegment.java and Summarizer.java
to be modified, so that I can tell them to use my own analyser, instead
of NutchDocumentAnalyzer.  I posted the required change in
IndexSegment.java in my earlier message; I'm quite happy to post the
required change in Summarizer.java, but it's very similar.

The job of an analyser is to turn a swag of text into a list of tokens,
prior to storing them in the index.  I think that any messing around
that you need to do with the tokens should be done in the analyser. 
Indeed, the NutchDocumentAnalyzer class does a whole lot of messing
round with the tokens, most of which I don't understand.  

I like the idea of an AnalyzerFactory too.  It's more elegant than my
solution and I would switch to it if it were available.  It would be
best to sit in the net.nutch.analysis package, and be similar code-wise
to UrlNormalizerFactory.  The AnalyzerFactory should accept some kind of
flag to indicate whether you want the analyser for indexing or for
making excerpts; as it's possible you'd want to use two different
analysers for these two operations.

Regards,
David.

Jack Tang wrote:

> David

> Please talk more about you own Analyzer:)
> And first I think we should know what NutchDocumentAnalyzer should
focus on and what should not(Anyone to explain?).
> BTW: I like AnalyzerFactory to maintain/cache all analyzers

> /Jack 

======= At 2005-04-12, 12:34:37 you wrote: =======

>Hi all,
>I have found a need to do document analysis other than that which is
>provided by the NutchDocumentAnalyzer class.  I have written my own
>Analyzer class, and I need to plug it into the Nutch framework.  What
>I've done is the following, and I'd like to suggest that it be made
part
>of the main Nutch development stream.  I don't know what the
"correct"
>procedure is for submitting such changes, so please everyone forgive
me
>if this list isn't a good place.
>
>In IndexSegment.java, replace the line that creates the IndexWriter
>object with:
>
>String analyzerClass = NutchConf.get("indexer.document.analyzer",
>"net.nutch.analysis.NutchDocumentAnalyzer");
>IndexWriter writer = new IndexWriter(
>            localOutput,
>            (Analyzer) Class.forName( analyzerClass ).newInstance(),
>            true );
>
>Then add an appropriate entry in nutch-site.xml / nutch-default.xml. 
>The default entry would be something like 
>
><property>
>  <name>indexer.document.analyzer</name>
>  <value>net.nutch.analysis.NutchDocumentAnalyzer</value>
>  <description>Class used by IndexSegment to analyze
>documents</description>
></property>
>
>Hope this can be considered.
>
>Regards,
>David.

********************************************************************************
This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Feature request - pluggable Analyzer

Reply via email to