Doug,

Aha.  I feel better knowing that I haven't lost my mind, now that I know
what you were trying to do.

As to a suggestion, I would only venture to say that, in its present form,
this results in confusing behavior (as noted in my original message).
Whether that drawback is outweighed by the benefits you mention, I could not
say in a general sense (although from my personal use, it is not).

Regards,

Terry

PS: Is this kind of thing (and more importantly, any other similar design
issues) documented any place?

PSS: What is the simplest way to alter this behavior to one that parses the
same regardless of the presence or absence of numeric characters?


----- Original Message -----
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, December 30, 2002 1:42 PM
Subject: Re: Incomprehensible (to me) tokenizing behavior


> Terry Steichen wrote:
> > I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the
a set of strings which produced the following results:
> >
> > "aa/bb/cc/dd" was tokenized into 4 terms: aa, bb, cc, dd
> > "aa/bb/cc/d1" was tokenized into 3 terms: aa, bb, cc/d1
> > "aa/bb/c1/dd" was tokenized into 2 terms: aa, bb/c1/dd
> > "aa/b1/cc/dd" was tokenized into 2 terms: aa/b1/cc, dd
> > "a1/bb/cc/dd" was tokenized into 3 terms: a1/bb, cc, dd
> >
> > It seems that if the input string includes a numerical value, any first
preceeding and/or next following slash ('/') is treated as a character.
Otherwise the slash is apparently treated as a token separator.
> >
> > I'm lost.  Assuming this is not a bug, could somebody explain the rhyme
and reason to this tokenizing logic?
>
> This is a heuristic that tries to index alphanumeric model and serial
> numbers as a single token, but not to index long hyphenated or slashed
> phrases as a single token.  It requires digits in at least every other
> slash or dash-delimted segment.
>
> Perhaps it is misguided.  Can you suggest a better heuristic?
>
> The challenge is to index things like "B-17", "F/A-18", "PS/2",
> "802.11a", "0-85152-629-2", etc. as single tokens, but not to index
> things like "once-famous-but-now-forgotten" or
> "red/orange/yellow/green/blue/indigo/violet" as single tokens.
>
> Doug
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to