Re: [Lucene.Net] StandardAnalyzer and lowercase

Erik Hatcher Mon, 07 Mar 2011 09:24:25 -0800

In Java Lucene, StandardAnalyzer lowercases (I can only speak to Java Lucene 
though).


Stored values are _never_ affected by analysis though.  What goes in is what 
gets stored.  Analysis is a complete different step and location in the index.

        Erik


On Mar 7, 2011, at 12:10 , Trevor Watson wrote:

> Thanks for the response!
> 
> We changed our code so we always use a StandardAnalyzer, which as far as I 
> know should always use a LowerCaseFilter when an IndexWriter writes to an 
> index.  However, using Luke shows that this isn't the case.
> 
> Does the StandardAnalyzer use a LowerCaseFilter?
> Should it be stored in the index without capitalization?
> Is there a way to force it to make all data lower case without using C#'s 
> ToLower()?  Would it be best to write an analyzer that extends 
> StandardAnalyzer to use a ToLower?
> 
> Thanks in advance.
> 
> Trevor
> 
> On 03/04/2011 7:12 PM, Digy wrote:
>> Hi Trevor,
>> 
>> Lucene.Net is intented to be a deterministic code :) So "NOT ALWAYS" or
>> "USUALLY" should mean a bug either in Lucene.Net or in your code. I would
>> recommend to revise your code and use Luke (http://www.getopt.org/luke/) to
>> inspect your index in order to see what you have in it.
>> 
>> DIGY
>> 
>> PS: Don't try to make searches on an index created with a different
>> analyzer.
>> 
>> 
>> -----Original Message-----
>> From: Trevor Watson [mailto:trevor.wat...@gmail.com]
>> Sent: Friday, March 04, 2011 9:04 PM
>> To: lucene-net-user@lucene.apache.org
>> Subject: [Lucene.Net] StandardAnalyzer and lowercase
>> 
>> I currently have a project that indexes multiple file formats. There is a
>> 2nd index that I use to keep track of files (because the queries in the
>> database are too slow, we query an index and use an ID field to get the
>> stuff out of the database)
>> 
>> However, I've started to run into some issues with the StandardAnalyzer.  We
>> were using different analyzers at one point, so moved all creations of an
>> anaylzer to this function
>> 
>> public static Analyzer getAnalyzer()
>> {
>>   Hashtable htStopWords = new Hashtable();
>>   Analyzer analyzer = new
>> StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, htStopWords);
>> 
>>   return analyzer;
>> }
>> 
>> So now all functions should now be using a StandardAnalyzer.
>> 
>> It is to my knowledge that a StandardAnalyzer uses a LowerCaseFilter to
>> change all strings to a lower-case string and in some cases that is true.
>> To get all documents in an index, we use a field called SearchAll and store
>> the word "SearchAll" into the index, then search for that.
>> 
>> Creation of the document to write is done in this function
>> 
>> public Document getFileInfoDoc()
>> {
>>    Document doc = new Document();
>>    doc.Add(new Field('FieldId", this.FieldID, Field.Store.YES,
>> Field.Index.NOT_ANALYZED));
>>    doc.Add(new Field("SelectAll", "SelectAll", Field.Store.NO,
>> Field.Index.ANALYZED));
>>    doc.Add(new Field("FilePath", this.FilePath, Field.Store.YES,
>> Field.Index.ANALYZED));
>> 
>>   return doc;
>> }
>> 
>> In one case we call this code
>> 
>> Document doc = getFileInfoDoc();
>> Analyzer analyzer = getAnalyzer();
>> indexWriter.UpdateDocument(new Term("FileId", this.FileId.ToString()), doc,
>> analyzer);
>> 
>> This code writes to the indexWriter, but DOES NOT ALWAYS apply the
>> LowerCaseFilter to the string stored in SelectAll.
>> 
>> To rebuild the index, we DeleteAllDocs from the index and loop through each
>> file to be stored, we then call the getFileInfoDoc from above and then call
>> the following 2 lines of code
>> 
>> Analyzer analyzer = getAnalyzer();
>> iwCurrent.UpdateDocument(new Term("FileId", iFileID.ToString(), doc,
>> analyzer);
>> 
>> this USUALLY stores the SearchAll field as lower case, but sometimes it
>> still fails and writes it as upper case.
>> 
>> 
>> 
>> Is there anything that I am missing in terms of making the LowerCaseFilter
>> be applied?  I don't particularly want to change the text to lower case in
>> my code as a 2nd index we use may be having the same issues, but contains
>> the contents of the file and changing that to lower case may have a major
>> impact on performance.
>> 
>> 
>> Thanks in advance,
>> 
>> Trevor Watson
>> 
>

Re: [Lucene.Net] StandardAnalyzer and lowercase

Reply via email to