> I was thinking of the simplistic scenario, where someone might be > looking for niño in some file, regardless of what locale they might > happen to be in. Now I can imagine the nightmare it must be for > non-English speakers looking for letter combinations irrespective of > accents. > > But, it seems more like a problem with the shorthand than grep, per > se.
i agree with this. or it's a historical problem with the character set. clearly if you were designing a universial character set with no compatability constraints, the alphabet would have nñ together so [a-z] would match both. > I could see an argument for [:alpha:] potentially matching n and > ñ depending on the locale, but [a-z] not matching ñ in any locale. But > even that, my tendency would be that [:alpha:] match ñ in every > locale. > > But then, does [:alpha:] match ἄγαθος? How ironic that it doesn't match α. i don't think one can go this route. you can't have a magic environment variable that changes everything. testing is a nightmare in such a world. you have to go through every combination of (data cs, locale) to see if things are working. a better solution is to use the properties of unicode. ñ is noted in the table as 00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small letter n tilde;;00d1;;00d1 field 6 has the base codepoint 006e as its first subfield. it would not be hard to build a table quickly mapping a codepoint to its base codepoint σ. but it would probablly be most useful to also have a mapping from base codepoints to all composed forms ξ. suppose, for lack of creativity, we use » to mean all base codepoints matching the next item character so »a matches ä as does »[a-z]. so for » of a letter c can be grepped by taking ξσ(c) which results in a character class. plan 9 already has some of this in the c library with tolowerrune, etc. i did some work with this some time ago and wrote some rc scripts to generate the to*rune tables from the unicode standard data. it would be easy to adapt them to generate ξ and σ. (the tables would be pretty big.) > > What an ugly problem. it can be made ugly quickly. but i'm not convinced that all approaches to this problem are bad. - erik
