On Sep 4, 2008, at 2:43 AM, Shai Erera wrote:
Hi
The COMPANY rule in StandardTokenizer is defined like this:
// Company names like AT&T and [EMAIL PROTECTED]
COMPANY = {ALPHA} ("&"|"@") {ALPHA}
While this works perfect for AT&T and [EMAIL PROTECTED], it doesn't work
well for strings like widget&javascript&html. Now, the latter is
obviously wrongly typed, and should have been separated by spaces,
but that's what a user typed in a document, and now we need to treat
it right (why don't they understand the rules of IR and
tokenization?). Normally I wouldn't care and say this is one of the
extreme cases, but unfortunately the tokenizer output two tokens:
widget&javascript and html. Now that bothers me - the user can
search for "html" and find the document, but not "javascript" or
"widget", which is a bit harder to explain to users, even the
intelligent ones.
That got me thinking on whether this rule is properly defined, and
what's the purpose of it. Obviously it's an attempt to not break
legal company names on "&" and "@", but I'm not sure it covers all
company name formats. For example, AT&T can be written as "AT &
T" (with spaces) and I've also seen cases where it's written as ATT.
While you could say "it's a best effort case", users don't buy that.
Either you do something properly (doesn't have to be 100% accurate
though), or you don't do it at all (I hope that doesn't sound too
harsh). That way it's easy to explain to your users that you simply
break on "&" or "@" (unless it's an email). They may not like it,
but you'll at least be consistent.
I do think that is a bit harsh. You can hardly expect the computer to
be perfect when humans aren't either. There are plenty of cases where
two people won't agree on what is proper either. This stuff is always
a balancing act.
I do, however, think this goes beyond COMPANY, and covers ACRYONYM (to
a lesser extent) and HOST as well (See also LUCENE-1373), and that we
shouldn't be in the game of implying semantic meaning from
StandardTokenizer/Filter all together. That is, my bigger concern is
that the tokenizer labels things as COMPANY or ACRONYM or HOST at all,
or better put, that users assume those types have any meaning outside
of the fact that they are simple labels that are a bit easier to
understand than TOKEN_TYPE_2 or something like that.
This rule slows StandardTokenizer's tokenization time, and
eventually does not produce consistent results. If we think it's
important to detect these tokens, then let's at least make it
consistent by either:
- changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby
recognizing "AT&T", and "widget&javascript&html" as COMPANY. That at
least will allow developers to put a CompanyTokenFilter (for
example) after the tokenizer to break on "&" and "@" whenever there
are more than 2 parts. We could also modify StandardFilter (which
already handles ACRONYM) to handle COMPANY that way.
- changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?")
so that we recognize company names only if the pattern is followed
by a space, dot, dash, underscore, exclamation mark or question
mark. That'll still recognize AT&T, but won't recognize
widget&javascript&html as COMPANY (which is good).
If I had to choose, this sounds reasonable.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]