The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I expected. Is this a bug in the analyzer or is this working as designed?
If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as input=bwl-esl2.gbr.hp.com output=[bwl-esl2.gbr.hp.com] input=esl2.gbr output=[esl2.gb][r] input=bwl-esl2 output=[bwl][esl2] input=bwl.esl2.gbr.hp.com output=[bwl.esl2.gbr.hp.com] The first 2 seem wrong to me. It seems as though it thinks there is an @ instead of the - in bwl-esl2.gbr.hp.com (i.e b...@esl2.gbr.hp.com). In which case, the tokenizing would make sense. The second one is even more difficult to understand. The word does not get tokenized if there are either both alphabets or both numbers surrounding a period. But in this case, there is a number on the left and a letter on the right of the period. And the tokenizing of the letter r is even more puzzling. By contrast, the standard analyzer works as I expect input=bwl-esl2.gbr.hp.com output=[bwl][esl2][gbr.hp.com] input=bwl-esl2 output=[bwl][esl2] input=bwl.esl2.gbr.hp.com output=[bwl.esl2][gbr.hp.com] input=esl2.gbr output=[esl2][gbr] Any insights would be appreciated -- Regards Milind