Incomprehensible (to me) tokenizing behavior

Terry Steichen Thu, 26 Dec 2002 13:41:44 -0800

I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of 
strings which produced the following results:


"aa/bb/cc/dd" was tokenized into 4 terms: aa, bb, cc, dd
"aa/bb/cc/d1" was tokenized into 3 terms: aa, bb, cc/d1 
"aa/bb/c1/dd" was tokenized into 2 terms: aa, bb/c1/dd
"aa/b1/cc/dd" was tokenized into 2 terms: aa/b1/cc, dd
"a1/bb/cc/dd" was tokenized into 3 terms: a1/bb, cc, dd

It seems that if the input string includes a numerical value, any first preceeding 
and/or next following slash ('/') is treated as a character.  Otherwise the slash is 
apparently treated as a token separator.

I'm lost.  Assuming this is not a bug, could somebody explain the rhyme and reason to 
this tokenizing logic?  

Regards,

Terry

PS: Using 1.3-dev 1

Incomprehensible (to me) tokenizing behavior

Reply via email to