Shingles won't to that either, so I suspect you'll have to write a custom tokenizer.
Best Erick On Wed, May 4, 2011 at 2:07 AM, Clemens Wyss <clemens...@mysign.ch> wrote: > I know this is just an example. > But even the WhitespaceAnalyzer takes the words apart, which I don't want. I > would like the phrases as they are (maximum 3 words, e.g. "Merlot del > Ticino", ...) to be n-gram-ed. I hence want to have the n-grams. > Mer > Merl > Merlo > Merlot > Merlot > Merlot d > ... > > Regards > Clemens >> -----Ursprüngliche Nachricht----- >> Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >> Gesendet: Dienstag, 3. Mai 2011 23:12 >> An: java-user@lucene.apache.org >> Betreff: Re: AW: AW: AW: "fuzzy prefix" search >> >> Clemens - that's just an example. Stick another tokenizer in there, like >> WhitespaceTokenizer in there, for example. >> >> Otis >> ---- >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem >> search :: http://search-lucene.com/ >> >> >> >> ----- Original Message ---- >> > From: Clemens Wyss <clemens...@mysign.ch> >> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> >> > Sent: Tue, May 3, 2011 4:31:14 PM >> > Subject: AW: AW: AW: "fuzzy prefix" search >> > >> > But doesn't the KeyWordTokenizer extract single words out oft he >> >stream? I would like to create n-grams on the stream (field content) as it >> is... >> > >> > > -----Ursprüngliche Nachricht----- >> > > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >> > > Gesendet: Dienstag, 3. Mai 2011 21:31 >> > > An: java-user@lucene.apache.org >> > > Betreff: Re: AW: AW: "fuzzy prefix" search >> > > >> > > Clemens, >> > > >> > > Something a la: >> > > >> > > public TokenStream tokenStream (String fieldName, Reader r) { >> > > return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), >> > > EdgeNGramTokenFilter.Side.FRONT, 1, 4); } >> > > >> > > >> > > Check out page 265 of Lucene in Action 2. >> > > >> > > Otis >> > > ---- >> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene >> > > ecosystem search :: http://search-lucene.com/ >> > > >> > > >> > > >> > > ----- Original Message ---- >> > > > From: Clemens Wyss <clemens...@mysign.ch> >> > > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> >> > > > Sent: Tue, May 3, 2011 12:57:39 PM >> > > > Subject: AW: AW: "fuzzy prefix" search >> > > > >> > > > How does an simple Analyzer look that just "n-grams" the docs/fields. >> > > > >> > > > class SimpleNGramAnalyzer extends Analyzer { @Override >> > > > public TokenStream tokenStream ( String fieldName, Reader reader ) >> > > > { >> > > > EdgeNGramTokenFilter... ??? >> > > > } >> > > > } >> > > > >> > > > > -----Ursprüngliche Nachricht----- >> > > > > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >> > > > > Gesendet: Dienstag, 3. Mai 2011 13:36 >> > > > > An: java-user@lucene.apache.org >> > > > > Betreff: Re: AW: "fuzzy prefix" search >> > > > > >> > > > > Hi, >> > > > > >> > > > > I didn't read this thread closely, but just in case: >> > > > > * Is this something you can handle with synonyms? >> > > > > * If this is for English and you are trying to handle typos, >> > > > > there is a >> >list >> > > >of >> > > > > common English misspellings out there that you could use for >> > > > > this >> > > perhaps. >> > > > > * Have you considered n-gramming your tokens? Not sure if >> > > > > this would >> > > help, >> > > > > didn't read messages/examples closely enough, but you may want >> > > > > to >> > > look at >> > > > > this if you haven't done so yet. >> > > > > >> > > > > Otis >> > > > > ---- >> > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> > > Lucene ecosystem >> > > > > search :: http://search-lucene.com/ >> > > > > >> > > > > >> > > > > >> > > > > ----- Original Message ---- >> > > > > > From: Clemens Wyss <clemens...@mysign.ch> >> > > > > > To: "java-user@lucene.apache.org" <java- >> u...@lucene.apache.org> >> > > > > > Sent: Tue, May 3, 2011 5:25:30 AM >> > > > > > Subject: AW: "fuzzy prefix" search >> > > > > > >> > > > > > >PrefixQuery >> > > > > > I'd like the combination of prefix and fuzzy ;-) because >> > > > > > people >> >could >> > > > > >also type "menlo" or "märl" and in any of these cases I'd >> > > > > like to >> >get >> > > > > >a hit on Merlot (for suggesting Merlot) >> > > > > > >> > > > > > > -----Ursprüngliche Nachricht----- >> > > > > > > Von: Ian Lea [mailto:ian....@gmail.com] >> > > > > > > Gesendet: Dienstag, 3. Mai 2011 11:22 > An: >> > > > > > java-user@lucene.apache.org >> > > > > > > Betreff: Re: "fuzzy prefix" search >> > > > > > > >> > > > > > > I'd assumed that FuzzyQuery wouldn't ignore case but I >> > > > > could be >> > > wrong. >> > > > > > > What would be the edit distance between "mer" and "merlot"? >> > > Would >> > > > > > > it be less that 1.5 which I reckon would be the value of >> > > > > > > length(term)*0.5 as detailed in the javadocs? Seems >> > > > > > > unlikely, >> >but >> > > > > > > I don't really know anything about the Levenshtein (edit >> distance) >> > > > > algorithm as used by FuzzyQuery. >> > > > > > > Wouldn't a PrefixQuery be more appropriate here? >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > Ian. >> > > > > > > >> > > > > > > On Tue, May 3, 2011 at 10:10 AM, Clemens Wyss >> > > > > > > <clemens...@mysign.ch> >> > > > > > > wrote: >> > > > > > > > Unfortunately lowercasing doesn't help. >> > > > > > > > Also, doesn't the FuzzyQuery ignore casing? >> > > > > > > > >> > > > > > > >> -----Ursprüngliche Nachricht----- >> > > > > > > >> Von: Ian Lea [mailto:ian....@gmail.com] >> > > > > > > >> Gesendet: Dienstag, 3. Mai 2011 11:06 >> > > > > > > >> An: java-user@lucene.apache.org >> > > > > > > >> Betreff: Re: "fuzzy prefix" search >> > > > > > > >> >> > > > > > > >> Mer != mer. The latter will be what is indexed >> > > > > > > because >> > > > > > > >> StandardAnalyzer calls LowerCaseFilter. >> > > > > > > >> >> > > > > > > >> -- >> > > > > > > >> Ian. >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> On Tue, May 3, 2011 at 9:56 AM, Clemens Wyss >> > > > > > > <clemens...@mysign.ch> >> > > > > > > >> wrote: >> > > > > > > >> > Sorry for coming back to my issue. Can anybody >> > > > > > > >> explain why >> >my >> > > > > > > "simple" >> > > > > > > >> unit test below fails? Any hint/help appreciated. >> > > > > > > >> > >> > > > > > > >> > Directory directory = new RAMDirectory(); >> > > > > > > >> IndexWriter >> > > > > > > >> > indexWriter = new IndexWriter( directory, new >> > > > > > > >> > StandardAnalyzer( >> > > > > > > Version.LUCENE_31 >> > > > > > > >> > ), IndexWriter.MaxFieldLength.UNLIMITED ); Document >> > > document >> > > > > = >> > > > > > > new >> > > > > > > >> > Document(); document.add( new Field( "test", "Merlot", >> > > > > > > >> > Field.Store.YES, Field.Index.ANALYZED ) ); >> > > > > > > >> > indexWriter.addDocument( >> > > > > > > >> > document ); IndexReader indexReader = >> > > > > > > indexWriter.getReader(); >> > > > > > > >> > IndexSearcher searcher = new IndexSearcher( >> > > > > indexReader ); >> > > > > > > >> > Query q = new FuzzyQuery( new Term( "test", "Mer" ), >> 0.5f, >> >0, >> > > > > > > >> > 10 ); // or Query q = new FuzzyQuery( new Term( >> > > > > > > >> > "test", >> "Mer" >> > > > > > > >> > ), 0.5f); TopDocs result = searcher.search( q, 10 >> > > > ); >> > > > > > > >> > Assert.assertEquals( 1, result.totalHits ); >> > > > > > > >> > >> > > > > > > >> > - Clemens >> > > > > > > >> > >> > > > > > > >> >> -----Ursprüngliche Nachricht----- >> > > > > > > >> >> Von: Clemens Wyss [mailto:clemens...@mysign.ch] >> > > > > > > >> >> Gesendet: Montag, 2. Mai 2011 23:01 >> > > > > > > >> >> An: java-user@lucene.apache.org >> > > > > > > >> >> Betreff: AW: "fuzzy prefix" search >> > > > > > > >> >> >> > > > > > > >> >> Is it the combination of FuzzyQuery and Term which >> > > > makes >> >the >> > > > > > > >> >> search to go for "word boundaries"? >> > > > > > > >> >> >> > > > > > > >> >> > -----Ursprüngliche Nachricht----- >> > > > > > > >> >> > Von: Clemens Wyss [mailto:clemens...@mysign.ch] >> > > > > > > >> >> > Gesendet: Montag, 2. Mai 2011 14:13 >> > > > > > > >> >> > An: java-user@lucene.apache.org >> > > > > > > >> >> > Betreff: AW: "fuzzy prefix" search >> > > > > > > >> >> > >> > > > > > > >> >> > I tried this too, but unfortunately I only get >> > > > > > > >> hits when >> > > > > > > >> >> > the search term is a least as long as the word to be >> >looked >> > > up. >> > > > > > > >> >> > >> > > > > > > >> >> > E.g.: >> > > > > > > >> >> > ... >> > > > > > > >> >> > Directory directory = new RAMDirectory(); IndexWriter >> > > > > > > >> >> > indexWriter = new IndexWriter( directory, >> >> >> > > > > > > > IndexManager.getIndexingAnalyzer( >> > > > > > > >> >> LOCALE_DE ), >> > > > > > > >> >> > IndexWriter.MaxFieldLength.UNLIMITED ); >> > > > > > > >> >> > >> > > > > > > >> >> > Document document = new Document(); >> document.add( >> > > new >> > > > > > > Field( >> > > > > > > >> >> > "test", "Merlot", >> > > > > > > >> >> > Field.Store.YES, Field.Index.ANALYZED ) >> > > > > > > >> ); >> > > > > > > >> >> indexWriter.addDocument( >> > > > > > > >> >> > document ); >> > > > > > > >> >> > >> > > > > > > >> >> > IndexReader indexReader = indexWriter.getReader(); >> > > > > > > >> >> > IndexSearcher >> > > > > > > >> >> > searcher = new IndexSearcher( indexReader ); >> >> > > > > > >> > >> > > > > > > >> >> > Query q = new FuzzyQuery( new Term( "test", "Mer" ), >> >0.6f, >> > > > > > > >> >> > 1 ); TopDocs result = searcher.search( q, 10 ); >> > > > > > > >> >> > Assert.assertEquals( >> >> > 1, >> > > > > > > >> >> result.totalHits ); ... >> > > > > > > >> >> > >> > > > > > > >> >> > > -----Ursprüngliche Nachricht----- >> > > > > > > >> >> > > Von: Uwe Schindler [mailto:u...@thetaphi.de] >> >> > > > > > > >> > > Gesendet: Montag, 2. Mai 2011 13:50 >> >> > > An: >> > > > > > > java-user@lucene.apache.org >> > > > > > > >> >> > > Betreff: RE: "fuzzy prefix" search >> > > > > > > >> >> > > >> > > > > > > >> >> > > Hi, >> > > > > > > >> >> > > >> > > > > > > >> >> > > You can pass an integer to FuzzyQuery which defines >> the >> > > > > > > >> >> > > number of characters that are seen as prefix. >> > > > > > > >> >> > So all >> > > > > > > >> >> > > terms must match >> > > > > > > >> >> > > this prefix and the rest of each term is matched >> > > using >> > > >fuzzy. >> > > > > > > >> >> > > >> > > > > > > >> >> > > Uwe >> > > > > > > >> >> > > >> > > > > > > >> >> > > ----- >> > > > > > > >> >> > > Uwe Schindler >> > > > > > > >> >> > > H.-H.-Meier-Allee 63, D-28213 Bremen >> > > > > > > >> http://www.thetaphi.de >> > > > > > > >> >> > > eMail: u...@thetaphi.de >> > > > > > > >> >> > > >> > > > > > > >> >> > > > -----Original Message----- >> > > > > > > >> >> > > > From: Clemens Wyss >> > > > > > [mailto:clemens...@mysign.ch] >> > > > > > > >> >> > > > Sent: Monday, May 02, 2011 1:47 PM >> > > > To: >> > > > > > > >> java-user@lucene.apache.org >> > > > > > > >> >> > > > Subject: "fuzzy prefix" search >> >> > > > >> > > > > > > >> >> > > > I'd like to search fuzzily but not on a full >> > > > term. >> > > > > > > >> >> > > > E.g. >> > > > > > > >> >> > > > I have a text "Merlot del Ticino" >> > > > > > > >> >> > > > I'd like >> > > > > > > >> >> > > > "mer", "merr", "melo", ... to match. >> > > > > > > >> >> > > > >> > > > > > > >> >> > > > If I use FuzzyQuery only "merlot, "merlott" hit. >> >What >> > > > > > > >> >> > > > Query-combination should I use? >> > > > > > > >> >> > > > >> > > > > > > >> >> > > > Thx >> > > > > > > >> >> > > > Clemens >> > > > > > > >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > >> >> > > > >> >-------------------------------------------------------- >> > > > > > > >> >> > > > ---- >> > > > > > > >> >> > > > --- >> > > > > > > >> >> > > > --- >> > > > > > > >> >> > > > -- >> > > > > > > >> >> > > > - To unsubscribe, e-mail: >> > > > > > > >> >> > > > java-user-unsubscr...@lucene.apache.org >> > > > > > > >> >> > > > For additional commands, e-mail: >> > > > > > > >> >> > > > java-user-h...@lucene.apache.org >> >> > > >> > > > > > > >> >> > > >> > > > > > > >> >> > > >> > > > > > > >> >> > > >> > > > > > > >> >> > > >> >---------------------------------------------------------- >> > > > > > > >> >> > > ---- >> > > > > > > >> >> > > --- >> > > > > > > >> >> > > --- >> > > > > > > >> >> > > - To unsubscribe, e-mail: >> > > > > > > >> >> > > java-user-unsubscr...@lucene.apache.org >> > > > > > > >> >> > > For additional commands, e-mail: >> > > > > > > >> >> > > java-user-h...@lucene.apache.org >> > > > > > > >> >> > >> > > > > > > >> >> > >> > > > > > > >> >> > >> > > > > > > >> >> >> >-------------------------------------------------------------- >> > > > > > > >> >> -- >> > > > > > > >> >> > --- >> > > > > > > >> >> > -- To unsubscribe, e-mail: >> > > > > > > >> >> > java-user-unsubscr...@lucene.apache.org >> > > > > > > >> >> > For additional commands, e-mail: >> > > > > > > >> >> > java-user-h...@lucene.apache.org >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> >-------------------------------------------------------------- >> > > > > > > >> >> ---- >> > > > > > > >> >> --- To unsubscribe, e-mail: >> > > > > > > >> >> java-user-unsubscr...@lucene.apache.org >> > > > > > > >> >> For additional commands, e-mail: >> > > > > > > java-user-h...@lucene.apache.org >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> >--------------------------------------------------------------- >> > > > > > > >> > ---- >> > > > > > > >> > -- To unsubscribe, e-mail: >> > > > > > > java-user-unsubscr...@lucene.apache.org >> > > > > > > >> > For additional commands, e-mail: >> > > > > > > java-user-h...@lucene.apache.org >> > >> > > > > > > >> > >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> >----------------------------------------------------------------- >> > > > > > > >> ---- >> > > > > > > >> To unsubscribe, e-mail: java-user- >> > > unsubscr...@lucene.apache.org >> > > > > > > >> For additional commands, e-mail: >> > > > > > > java-user-h...@lucene.apache.org > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> >------------------------------------------------------------------ >> > > > > > > > --- >> > > > > > > > To unsubscribe, e-mail: >> > > > > > > java-user-unsubscr...@lucene.apache.org >> > > > > > > > For additional commands, e-mail: java-user- >> > > h...@lucene.apache.org >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> >-------------------------------------------------------------------- >> > > > > > > - To unsubscribe, e-mail: >> > > java-user-unsubscr...@lucene.apache.org >> > > > > > > For additional commands, e-mail: >> > > java-user-h...@lucene.apache.org > > > >> > > > > > >> > > > > > >> --------------------------------------------------------------------- >> > > > > > To unsubscribe, e-mail: >> > > java-user-unsubscr...@lucene.apache.org >> > > > > > For additional commands, e-mail: >> > > java-user-h...@lucene.apache.org > > > >> > > > > > >> > > > > >> > > > > >> > > > --------------------------------------------------------------------- >> > > > > To unsubscribe, e-mail: >> > > java-user-unsubscr...@lucene.apache.org >> > > > > For additional commands, e-mail: >> > > java-user-h...@lucene.apache.org > >> > > > >> > > > >> > > > ------------------------------------------------------------------ >> > > > --- >> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > > For additional commands, e-mail: >> > > java-user-h...@lucene.apache.org > >> > > > >> > > >> > > >> > > -------------------------------------------------------------------- >> > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org