Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Milind Wed, 23 Jul 2014 10:50:30 -0700

The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
expected.  Is this a bug in the analyzer or is this working as designed?


If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as
    input=bwl-esl2.gbr.hp.com
    output=[bwl-esl2.gbr.hp.com]

    input=esl2.gbr
    output=[esl2.gb][r]

    input=bwl-esl2
    output=[bwl][esl2]

    input=bwl.esl2.gbr.hp.com
    output=[bwl.esl2.gbr.hp.com]

The first 2 seem wrong to me.  It seems as though it thinks there is an @
instead of the - in bwl-esl2.gbr.hp.com (i.e b...@esl2.gbr.hp.com).  In
which case, the tokenizing would make sense.  The second one is even more
difficult to understand.  The word does not get tokenized if there are
either both alphabets or both numbers surrounding a period.  But in this
case, there is a number on the left and a letter on the right of the
period.  And the tokenizing of the letter r is even more puzzling.

By contrast, the standard analyzer works as I expect
    input=bwl-esl2.gbr.hp.com
    output=[bwl][esl2][gbr.hp.com]

    input=bwl-esl2
    output=[bwl][esl2]

     input=bwl.esl2.gbr.hp.com
     output=[bwl.esl2][gbr.hp.com]

     input=esl2.gbr
     output=[esl2][gbr]

Any insights would be appreciated

-- 
Regards
Milind

Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Reply via email to