AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe Goetzke Tue, 25 Mar 2008 09:41:40 -0700

Jake,

With the bigram-based index we gave up for the struggle to find a well working 
language based index.
We had implemented soundex (or different "sound"-alikes) and hyphenating but 
failed to deliver a user explainable search result ("why is this ranked higher" 
and so on...). One reason may be that product descriptions contain a lot of 
abbreviations.


The index size grew about 30%.
The search performance seems a bit slower but I no concrete figures. The 
evaluation for a for one document is a bit more complex than a phrase query. 
One reason of course is that there a more terms evaluated. But nevertheless it 
is quite good.

The search relevance improved tremendously. Missing characters, switched 
letters and partial word fragments are no real problems any more (of course 
dependent on the length of the search word).
Search term "weekday" finds also "day of the week", "disabigaute" finds 
"disambiguate".
The algorithms I developed might not fit other domains but for multi language 
catalogs of products it works quite well for us. So far...


Regards Uwe

-----Ursprüngliche Nachricht-----
Von: Jake Mannix [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 25. März 2008 17:13
An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe,
  This is a little off thread-topic, but I was wondering how your
search relevance and search performance has fared with this
bigram-based index.  Is it significantly better than before you use
the NGramAnalyzer?
   -jake



On 3/24/08, Uwe Goetzke <[EMAIL PROTECTED]> wrote:
> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
>       fTextTokenStream = result = new
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>       result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
> modified that ö -> oe
>       result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>       result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
> //just a
> bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase
> query, scoring in a doc the greatest bigramms clusters covering the phrase
> token.
>
> Best Regards
>
> Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 21. März 2008 16:25
> An: java-user@lucene.apache.org
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
> > This week I switched the lucene library version on one customer system.
> > The indexing speed went down from 46m32s to 16m20s for the complete task
> > including optimisation. Great Job!
> > We index product catalogs from several suppliers, in this case around
> > 56.000 product groups and 360.000 products including descriptions were
> > indexed.
> >
> > Regards
> >
> > Uwe
> >
> >
> >
> > -----------------------------------------------------------------------
> > Healy Hudson GmbH - D-55252 Mainz Kastel
> > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
> >
> > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte
> loschen Sie danach diese Email.
> > This email is confidential. If you are not the intended recipient, you
> must not disclose or use this information contained in it. If you have
> received this email in error please tell us immediately by return email and
> delete the document.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > __________ NOD32 2913 (20080301) Information __________
> >
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger
> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte
> löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must
> not disclose or use this information contained in it. If you have received
> this email in error please tell us immediately by return email and delete
> the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-- 
Sent from Gmail for mobile | mobile.google.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Reply via email to