I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of
strings which produced the following results:
"aa/bb/cc/dd" was tokenized into 4 terms: aa, bb, cc, dd
"aa/bb/cc/d1" was tokenized into 3 terms: aa, bb, cc/d1
"aa/bb/c1/dd" was tokenized into 2 terms: aa, bb/c1/dd
"aa/b1/cc/dd" was tokenized into 2 terms: aa/b1/cc, dd
"a1/bb/cc/dd" was tokenized into 3 terms: a1/bb, cc, dd
It seems that if the input string includes a numerical value, any first preceeding
and/or next following slash ('/') is treated as a character. Otherwise the slash is
apparently treated as a token separator.
I'm lost. Assuming this is not a bug, could somebody explain the rhyme and reason to
this tokenizing logic?
Regards,
Terry
PS: Using 1.3-dev 1