Re: Is the COMPANY rule in StandardTokenizer valid?

Grant Ingersoll Thu, 04 Sep 2008 05:47:46 -0700


On Sep 4, 2008, at 2:43 AM, Shai Erera wrote:

Hi

The COMPANY rule in StandardTokenizer is defined like this:

// Company names like AT&T and [EMAIL PROTECTED]
COMPANY    =  {ALPHA} ("&"|"@") {ALPHA}
While this works perfect for AT&T and [EMAIL PROTECTED], it doesn't workwell for strings like widget&javascript&html. Now, the latter isobviously wrongly typed, and should have been separated by spaces,but that's what a user typed in a document, and now we need to treatit right (why don't they understand the rules of IR andtokenization?). Normally I wouldn't care and say this is one of theextreme cases, but unfortunately the tokenizer output two tokens:widget&javascript and html. Now that bothers me - the user cansearch for "html" and find the document, but not "javascript" or"widget", which is a bit harder to explain to users, even theintelligent ones.
That got me thinking on whether this rule is properly defined, andwhat's the purpose of it. Obviously it's an attempt to not breaklegal company names on "&" and "@", but I'm not sure it covers allcompany name formats. For example, AT&T can be written as "AT &T" (with spaces) and I've also seen cases where it's written as ATT.
While you could say "it's a best effort case", users don't buy that.Either you do something properly (doesn't have to be 100% accuratethough), or you don't do it at all (I hope that doesn't sound tooharsh). That way it's easy to explain to your users that you simplybreak on "&" or "@" (unless it's an email). They may not like it,but you'll at least be consistent.

I do think that is a bit harsh. You can hardly expect the computer tobe perfect when humans aren't either. There are plenty of cases wheretwo people won't agree on what is proper either. This stuff is alwaysa balancing act.

I do, however, think this goes beyond COMPANY, and covers ACRYONYM (toa lesser extent) and HOST as well (See also LUCENE-1373), and that weshouldn't be in the game of implying semantic meaning fromStandardTokenizer/Filter all together. That is, my bigger concern isthat the tokenizer labels things as COMPANY or ACRONYM or HOST at all,or better put, that users assume those types have any meaning outsideof the fact that they are simple labels that are a bit easier tounderstand than TOKEN_TYPE_2 or something like that.

This rule slows StandardTokenizer's tokenization time, andeventually does not produce consistent results. If we think it'simportant to detect these tokens, then let's at least make itconsistent by either:
- changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, therebyrecognizing "AT&T", and "widget&javascript&html" as COMPANY. That atleast will allow developers to put a CompanyTokenFilter (forexample) after the tokenizer to break on "&" and "@" whenever thereare more than 2 parts. We could also modify StandardFilter (whichalready handles ACRONYM) to handle COMPANY that way.
- changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?")so that we recognize company names only if the pattern is followedby a space, dot, dash, underscore, exclamation mark or questionmark. That'll still recognize AT&T, but won't recognizewidget&javascript&html as COMPANY (which is good).


If I had to choose, this sounds reasonable.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Is the COMPANY rule in StandardTokenizer valid?

Reply via email to