Jay, Have a look at Lucene config, it's all there, including tests. This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram size and even min and max n-gram size.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 1:32:24 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: > Hi Ivan, > No, we do not use StandardAnalyser or StandardTokenizer. > > Most data is processed by > fTextTokenStream = result = new > org.apache.lucene.analysis.WhitespaceTokenizer(reader); > result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter > modified that ö -> oe > result = new org.apache.lucene.analysis.LowerCaseFilter(result); > result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just > a bigram tokenizer > > We use our own queryparser. The bigramms are searched with a tolerant phrase > query, scoring in a doc the greatest bigramms clusters covering the phrase > token. > > Best Regards > > Uwe > > -----Ursprüngliche Nachricht----- > Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 21. März 2008 16:25 > An: java-user@lucene.apache.org > Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Hi Uwe, > > Could you tell what Analyzer do you use when you marked so big indexing > speedup? > If you use StandardAnalyzer (that uses StandardTokenizer) may be the > reason is in it. You can see the pre last report in the thread "Indexing > Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake > Mannix this is because now StandardTokenizer uses StandardTokenizerImpl > that now is generated by JFlex instead of JavaCC. > I am asking because I noticed a great speedup in adding documents to > index in our system. We have time control on this in the debug mode. NOW > THEY ARE ADDED 5 TIMES FASTER!!! > But in the same time the total process of indexing in our case has > improvement of about 8%. As our system is very big and complex I am > wondering if really the whole process of indexing is reduces so > remarkably and our system causes this slowdown or may be Lucene does > some optimizations on the index, merges or something else and this is > the reason the total process of indexing to be not so reasonably faster. > > Best Regards, > Ivan > > > > Uwe Goetzke wrote: >> This week I switched the lucene library version on one customer system. >> The indexing speed went down from 46m32s to 16m20s for the complete task >> including optimisation. Great Job! >> We index product catalogs from several suppliers, in this case around >> 56.000 product groups and 360.000 products including descriptions were >> indexed. >> >> Regards >> >> Uwe >> >> >> >> ----------------------------------------------------------------------- >> Healy Hudson GmbH - D-55252 Mainz Kastel >> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 >> >> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger >> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie >> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte >> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte >> loschen Sie danach diese Email. >> This email is confidential. If you are not the intended recipient, you must >> not disclose or use this information contained in it. If you have received >> this email in error please tell us immediately by return email and delete >> the document. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> __________ NOD32 2913 (20080301) Information __________ >> >> This message was checked by NOD32 antivirus system. >> http://www.eset.com >> >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > ----------------------------------------------------------------------- > Healy Hudson GmbH - D-55252 Mainz Kastel > Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 > > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, > dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese > Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend > mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie > danach diese Email. > This email is confidential. If you are not the intended recipient, you must > not disclose or use this information contained in it. If you have received > this email in error please tell us immediately by return email and delete the > document. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]