So this will recognize anything in quotes as a single token and '_' and
'-' will not break up words. There may be some repercussions for the NUM
token but nothing I'd worry about. maybe you want to use Unicode for '-'
and '_' as well...I wouldn't worry about it myself.
- Mark
TOKEN : { // token patterns
// basic word: a sequence of digits & letters
<ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
| <QUOTED: "\"" (~["\""])+ "\"">
// internal apostrophes: O'Reilly, you're, O'Reilly's
// use a post-filter to remove possesives
| <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
// company names like AT&T and [EMAIL PROTECTED]
| <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
// email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
(("."|"-") <ALPHANUM>)+ >
// hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
| <HAS_DIGIT> <P> <ALPHANUM>
| <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
| <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
| <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
| <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
)
>
| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT: // at least one digit
(<LETTER>|<DIGIT>)*
<DIGIT>
(<LETTER>|<DIGIT>)*
>
| < #ALPHA: (<LETTER>)+>
| < #LETTER: // unicode letters
[
"\u0041"-"\u005a",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"-", "_"
]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]