Re: [9fans] simplicity

erik quanstrom Wed, 10 Oct 2007 05:22:51 -0700

> I was thinking of the simplistic scenario, where someone might be
> looking for niño in some file, regardless of what locale they might
> happen to be in.  Now I can imagine the nightmare it must be for
> non-English speakers looking for letter combinations irrespective of
> accents.
> 
> But, it seems more like a problem with the shorthand than grep, per
> se.


i agree with this.  or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatability
constraints, the alphabet would have nñ together so [a-z] would 
match both.

> I could see an argument for [:alpha:] potentially matching n and
> ñ depending on the locale, but [a-z] not matching ñ in any locale. But
> even that, my tendency would be that [:alpha:] match ñ in every
> locale.
> 
> But then, does [:alpha:] match ἄγαθος?  How ironic that it doesn't match α.

i don't think one can go this route.  you can't have a magic environment
variable that changes everything.  testing is a nightmare in such a world.
you have to go through every combination of (data cs, locale) to see if
things are working.

a better solution is to use the properties of unicode.  ñ is noted in the
table as

00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small letter n 
tilde;;00d1;;00d1

field 6 has the base codepoint 006e as its first subfield.  it would not be hard
to build a table quickly mapping a codepoint to its base codepoint σ.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms ξ.

suppose, for lack of creativity, we use » to mean all base codepoints
matching the next item character so »a matches ä as does »[a-z].
so for » of a letter c can be grepped by taking ξσ(c) which results
in a character class.

plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data.  it would
be easy to adapt them to generate ξ and σ.  (the tables would be pretty big.)

> 
> What an ugly problem.

it can be made ugly quickly.  but i'm not convinced that all approaches
to this problem are bad.

- erik

Re: [9fans] simplicity

Reply via email to