Re: Incomprehensible (to me) tokenizing behavior

Doug Cutting Mon, 30 Dec 2002 10:41:30 -0800

Terry Steichen wrote:

I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of strings which produced the following results:

"aa/bb/cc/dd" was tokenized into 4 terms: aa, bb, cc, dd
"aa/bb/cc/d1" was tokenized into 3 terms: aa, bb, cc/d1 "aa/bb/c1/dd" was tokenized into 2 terms: aa, bb/c1/dd
"aa/b1/cc/dd" was tokenized into 2 terms: aa/b1/cc, dd
"a1/bb/cc/dd" was tokenized into 3 terms: a1/bb, cc, dd

It seems that if the input string includes a numerical value, any first preceeding and/or next following slash ('/') is treated as a character. Otherwise the slash is apparently treated as a token separator.

I'm lost. Assuming this is not a bug, could somebody explain the rhyme and reason to this tokenizing logic?

This is a heuristic that tries to index alphanumeric model and serial numbers as a single token, but not to index long hyphenated or slashed phrases as a single token. It requires digits in at least every other slash or dash-delimted segment.

Perhaps it is misguided. Can you suggest a better heuristic?

The challenge is to index things like "B-17", "F/A-18", "PS/2", "802.11a", "0-85152-629-2", etc. as single tokens, but not to index things like "once-famous-but-now-forgotten" or "red/orange/yellow/green/blue/indigo/violet" as single tokens.

Doug

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Incomprehensible (to me) tokenizing behavior

Reply via email to