Re: [Lucene.Net] StandardAnalyzer and lowercase

Trevor Watson Mon, 07 Mar 2011 09:13:26 -0800

Thanks for the response!

We changed our code so we always use a StandardAnalyzer, which as far asI know should always use a LowerCaseFilter when an IndexWriter writes toan index. However, using Luke shows that this isn't the case.


Does the StandardAnalyzer use a LowerCaseFilter?
Should it be stored in the index without capitalization?

Is there a way to force it to make all data lower case without usingC#'s ToLower()? Would it be best to write an analyzer that extendsStandardAnalyzer to use a ToLower?


Thanks in advance.

Trevor

On 03/04/2011 7:12 PM, Digy wrote:

Hi Trevor,

Lucene.Net is intented to be a deterministic code :) So "NOT ALWAYS" or
"USUALLY" should mean a bug either in Lucene.Net or in your code. I would
recommend to revise your code and use Luke (http://www.getopt.org/luke/) to
inspect your index in order to see what you have in it.

DIGY

PS: Don't try to make searches on an index created with a different
analyzer.


-----Original Message-----
From: Trevor Watson [mailto:trevor.wat...@gmail.com]
Sent: Friday, March 04, 2011 9:04 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] StandardAnalyzer and lowercase

I currently have a project that indexes multiple file formats. There is a
2nd index that I use to keep track of files (because the queries in the
database are too slow, we query an index and use an ID field to get the
stuff out of the database)

However, I've started to run into some issues with the StandardAnalyzer.  We
were using different analyzers at one point, so moved all creations of an
anaylzer to this function

public static Analyzer getAnalyzer()
{
   Hashtable htStopWords = new Hashtable();
   Analyzer analyzer = new
StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, htStopWords);

   return analyzer;
}

So now all functions should now be using a StandardAnalyzer.

It is to my knowledge that a StandardAnalyzer uses a LowerCaseFilter to
change all strings to a lower-case string and in some cases that is true.
To get all documents in an index, we use a field called SearchAll and store
the word "SearchAll" into the index, then search for that.

Creation of the document to write is done in this function

public Document getFileInfoDoc()
{
    Document doc = new Document();
    doc.Add(new Field('FieldId", this.FieldID, Field.Store.YES,
Field.Index.NOT_ANALYZED));
    doc.Add(new Field("SelectAll", "SelectAll", Field.Store.NO,
Field.Index.ANALYZED));
    doc.Add(new Field("FilePath", this.FilePath, Field.Store.YES,
Field.Index.ANALYZED));

   return doc;
}

In one case we call this code

Document doc = getFileInfoDoc();
Analyzer analyzer = getAnalyzer();
indexWriter.UpdateDocument(new Term("FileId", this.FileId.ToString()), doc,
analyzer);

This code writes to the indexWriter, but DOES NOT ALWAYS apply the
LowerCaseFilter to the string stored in SelectAll.

To rebuild the index, we DeleteAllDocs from the index and loop through each
file to be stored, we then call the getFileInfoDoc from above and then call
the following 2 lines of code

Analyzer analyzer = getAnalyzer();
iwCurrent.UpdateDocument(new Term("FileId", iFileID.ToString(), doc,
analyzer);

this USUALLY stores the SearchAll field as lower case, but sometimes it
still fails and writes it as upper case.



Is there anything that I am missing in terms of making the LowerCaseFilter
be applied?  I don't particularly want to change the text to lower case in
my code as a 2nd index we use may be having the same issues, but contains
the contents of the file and changing that to lower case may have a major
impact on performance.


Thanks in advance,

Trevor Watson

Re: [Lucene.Net] StandardAnalyzer and lowercase

Reply via email to