"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> ISTM that perhaps a more generally useful definition would be
>>> lword               Only ASCII letters
>>> nlword              Entirely letters per iswalpha(), but not lword
>>> word                Entirely alphanumeric per iswalnum(), but not nlword
>>>             (hence, includes at least one digit)
>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars. 
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.

For what it's worth I did the same search last night and found three French
words including "çà" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "íð" and "óð".

> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?

I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.

> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to