Re: "Umlaute" getting lost

Simon Willnauer Mon, 25 Apr 2011 03:13:42 -0700

On Sun, Apr 24, 2011 at 8:30 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>
> On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:
>
>> I keep my search terms in a dedicated RAMDirectory (the termIndex).
>> In there I palce all the term of my real index. When putting the terms into 
>> the
>> termIndex I can still see [using the debugger] the Umlaute (äöü). 
>> Unfortunately when searching the
>> termIndex the documents no more contain these Umlaute.
>>
>> Populating the termIndex:
>> termIndex = new RAMDirectory();
>> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31, new 
>> TermAnalyzer( locale ) );
>> termIndexWriter = new IndexWriter( termIndex, config );
>> TermEnum tEnum = realIndexReader.terms();
>> while ( tEnum.next() )
>> {
>>       Term t = tEnum.term();
>>       String termText = t.text();
>>       Document termDocument = new Document();
>>       Field field = new Field( FIELDNAME_TERM, termText, Field.Store.YES, 
>> Field.Index.ANALYZED );
>>       termDocument.add( field );
>>       // and add term into the index
>>       termIndexWriter.addDocument( termDocument );
>> }
>> termIndexWriter.commit();
>> termIndexWriter.optimize();
>> termIndexWriter.close();
>>
>> termIndexReader = IndexReader.open( termIndex, true );
>> ---------- searching terms
>> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, 
>> termFilter.toLowerCase() ) ) :
>>                                       new WildcardQuery( new Term( 
>> FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) );
>> TopDocs topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );
>> for ( ScoreDoc hit : topDocs.scoreDocs )
>> {
>>       Document doc = getTermIndexReader().document( hit.doc );
>>       String indexTerm = doc.get( FIELDNAME_TERM );
>>       if ( !returnValue.contains( indexTerm  ) )
>>       {
>>               returnValue.add( indexTerm );
>>       }
>> }
>> ----------
>> The TermAbnalyzer is the same analyzer as the main index analyzer with the 
>> exception that a LowerCaseFilter is applied.
>
> What is the Analyzer for the Main Index?  What is the tokenizer and token 
> filters used?


in other words, can you provide what TermAnalyzer is composed of?


simon
>
> Out of curiosity, what is the problem you are trying to solve?
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: "Umlaute" getting lost

Reply via email to