Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Otis Gospodnetic Tue, 25 Mar 2008 15:02:26 -0700

Jay,

Have a look at Lucene config, it's all there, including tests.  This filter 
will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> 
fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size 
and even min and max n-gram size.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter 
stemming and word ngram identification?

Thanks!

Jay

Uwe Goetzke wrote:
> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
> 
> Most data is processed by 
>     fTextTokenStream = result = new 
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
> modified that ö -> oe
>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just 
> a bigram tokenizer
> 
> We use our own queryparser. The bigramms are searched with a tolerant phrase 
> query, scoring in a doc the greatest bigramms clusters covering the phrase 
> token. 
> 
> Best Regards
> 
> Uwe
> 
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
> Gesendet: Freitag, 21. März 2008 16:25
> An: java-user@lucene.apache.org
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
> 
> Hi Uwe,
> 
> Could you tell what Analyzer do you use when you marked so big indexing 
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
> reason is in it. You can see the pre last report in the thread "Indexing 
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to 
> index in our system. We have time control on this in the debug mode. NOW 
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has 
> improvement of about 8%. As our system is very big and complex I am 
> wondering if really the whole process of indexing is reduces so 
> remarkably and our system causes this slowdown or may be Lucene does 
> some optimizations on the index, merges or something else and this is 
> the reason the total process of indexing to be not so reasonably faster.
> 
> Best Regards,
> Ivan
> 
> 
> 
> Uwe Goetzke wrote:
>> This week I switched the lucene library version on one customer system.
>> The indexing speed went down from 46m32s to 16m20s for the complete task
>> including optimisation. Great Job!
>> We index product catalogs from several suppliers, in this case around
>> 56.000 product groups and 360.000 products including descriptions were
>> indexed.
>>
>> Regards
>>
>> Uwe
>>
>>
>>
>> -----------------------------------------------------------------------
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger 
>> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie 
>> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte 
>> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte 
>> loschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient, you must 
>> not disclose or use this information contained in it. If you have received 
>> this email in error please tell us immediately by return email and delete 
>> the document.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> __________ NOD32 2913 (20080301) Information __________
>>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>
>>
>>
>>   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
> 
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
> dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
> Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
> mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
> danach diese Email.
> This email is confidential. If you are not the intended recipient, you must 
> not disclose or use this information contained in it. If you have received 
> this email in error please tell us immediately by return email and delete the 
> document.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Reply via email to