Hey everyone, Tokenization seems inherently fuzzy and imprecise, yet Lucene does not appear to provide an easy mechanism to account for this fuzziness.
Let's take an example, where the document I'm indexing is "v1.1.0 mr. jones da...@gmail.com" I may want to tokenize this as follows: ["v1.1.0", "mr", "jones", "da...@gmail.com"] ...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones", "david", "gmail.com"] ...or I may want to tokenize it another way. I would think that the best approach would be indexing using multiple strategies, such as: ["v1.1.0", "v1", "1.0", "mr", "jones", "da...@gmail.com", "david", "gmail.com"] However, this would destroy phrase queries. And while Lucene lets you index multiple tokens at the same position, I haven't found a way to deal with cases where you want to index a set of tokens at one position: nor does that even make sense. For instance, I can't index ["david", "gmail.com"] in the same position as "da...@gmail.com". So: - Any thoughts, in general, about how you all approach this fuzziness? Do you just choose one tokenization strategy and hope for the best? - Might there be a way to use multiple strategies and *not* break phrase queries that I'm overlooking? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-Allow-Multiple-Strategies-tp2444956p2444956.html Sent from the Solr - Dev mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org