Hello, Mike and I were discussing some very unrelated stuff and the question of how to handle the empty term came up...
I started thinking about this email: http://www.lucidimagination.com/search/document/a8d3a8647e581a5b/patternreplacefilterfactory_creating_empty_string_as_a_term#f43d167b91c2ba07 So, looking through the analyzers, I think we should make a decision about what to do with empty terms. In my opinion there is a performance trap here, that might work like this: 1. a user, particularly say a solr user is using a combination of tokenizers/filters and ends out with the "empty term" as basically a mega-stopword, like what happened to that user. 2. due to this, their queries have terrible performance, especially if they are 'auto-generating phrase queries' (the solr default) 3. but, its not possible that anyone can really even rely upon the analyzers handling empty terms correctly, because we are so inconsistent about it. Just taking a quick glance through the analyzers, i noticed each one seems to have willy-nilly code/TODO's regarding this empty term. for example, the n-gramish tokenizers such as CJKTokenizer, CommonGramsFilter, NGramTokenizer, etc explicitly avoid creating these. But there are inconsistencies: TrimFilter explicitly creates/maintains empty terms. NGramFilter doesnt seem to have this check, but the NGramTokenizer does. PatternFilter documents it might create empty terms, but the PatternTokenizer avoids them. I am sure some of the stemmers probably create empty terms in some situations (eg maybe it removes -alization suffix, but has no length check, and if the term is "alization" it makes empty terms) Anyway, I think its possible other users might be in this same situation, with slow performance, and not even realizing it yet... Obviously they can fix this if they go and add LengthFilter, but should we be doing something different? --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org