Tokenization and Fuzziness: How to Allow Multiple Strategies?

Tavi Nathanson Mon, 07 Feb 2011 09:12:44 -0800

Hey everyone,

Tokenization seems inherently fuzzy and imprecise, yet Lucene does not
appear to provide an easy mechanism to account for this fuzziness.


Let's take an example, where the document I'm indexing is "v1.1.0 mr. jones
[email protected]"

I may want to tokenize this as follows: ["v1.1.0", "mr", "jones",
"[email protected]"]
...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones",
"david", "gmail.com"]
...or I may want to tokenize it another way.

I would think that the best approach would be indexing using multiple
strategies, such as:

["v1.1.0", "v1", "1.0", "mr", "jones", "[email protected]", "david",
"gmail.com"]

However, this would destroy phrase queries. And while Lucene lets you index
multiple tokens at the same position, I haven't found a way to deal with
cases where you want to index a set of tokens at one position: nor does that
even make sense. For instance, I can't index ["david", "gmail.com"] in the
same position as "[email protected]".

So:

- Any thoughts, in general, about how you all approach this fuzziness? Do
you just choose one tokenization strategy and hope for the best?
- Might there be a way to use multiple strategies and *not* break phrase
queries that I'm overlooking?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-Allow-Multiple-Strategies-tp2444956p2444956.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Tokenization and Fuzziness: How to Allow Multiple Strategies?

Reply via email to