I figured out a work-around in the custom analyzer by doing the folllowing

// --------------- code block ---------------------
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
TextReader newReader = new StringReader(reader.ReadToEnd().Replace(".", ". ")); TokenStream result = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, newReader);

// -------------end code block -----------------


It seems to work this way.  Thanks again.




On 06/16/2011 11:31 AM, Trevor Watson wrote:
I'm trying to get Lucene.Net to create terms the way that we want it to happen. I'm currently running Lucene.Net 2.9.2.2.

Bascially, we want the StandardAnalyzer with the exception that we want terms to be divided at a period as well. The StandardAnalyzer seems to only split the 2 words into terms if the period is followed by white-space.

So if we index autoexec.bat it should do [autoexec] and [bat], not [autoexec.bat]

I was trying to create my own Analyzer that would do it, but could not figure out how.


So far I have a very basic analyzer that uses the StandardTokenizer and 2 filters.

// --------- code block ----------------------

class ExtendedStandardAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
TokenStream result = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader); // TokenStream result = new LetterTokenizer(reader); // doesn't work because we want numbers

        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);

        return result;
    }
}
// --------- end code block ------------------


Thanks in advance.

Reply via email to