On Jan 23, 2013, at 2:18 AM, jonat...@mugginsoft.com wrote:
> Hmm. Maybe not. I want to keep the generated variable name legible.

You need to nail down the languages you want to deploy to, and then find out 
what their criteria for identifiers are. Then you can decide to either generate 
identical lowest-common-denominator-names for all of them (which is 
[a-zA-Z_]([a-zA-Z_0-9]*) in the case of C, i.e. it may not start with a number 
either), or adjust what characters to permit based on the target programming 
language.

Apart from character set, you may also have to be aware of length limits etc. 
Early C compilers, for instance, only used the first 8 characters of an 
identifier. So "ExceptionalHouse" and "ExceptionalCow" both ended up as the 
same identifier, "Exceptio". I'm hard pressed to think of a language with such 
a limit today, but I don't know what languages your targeting. Maybe one has 
such a limit.

If you have a case where you can't express a character in a particular 
character set, you have several options:

1) Transcribe it to an equivalent character set. E.g. U-Umlaut (ü) is usually 
written as "ue". However, you will then have to deal with collisions. E.g. what 
if one user enters the word "Frauen", but another makes up a new word "Fraün". 
The latter would transcribe to the former, and you might get unexpected side 
effects. You might have to generate a look-up-table, and if you find a 
collision like that, make the name unique again, e.g. by naming one "frauen" 
and the other "frauen2". IIRC there are official transcriptions for many 
languages, e.g. Romanji for Japanese characters.

2) Fail and tell the user what the valid characters are, and only let them 
enter valid characters.

3) Transcribe in some other way, e.g. by base64-encoding, or using a 
hex-representation of the given byte sequence, or whatever. This way you could 
keep ASCII sentences valid, but modify everything else. But even then you could 
have collisions. E.g. if you replace spaces with underscores, what if there's a 
second version with the underscore?

> Is  + (id)letterCharacterSet the best choice here?

According to the docs 
(https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSCharacterSet_Class/Reference/Reference.html):
 "An NSCharacterSet object represents a set of Unicode-compliant characters." 
The +letterCharacterSet documentation says "Returns a character set containing 
the characters in the categories Letters and Marks." So a Google later, here 
http://www.fileformat.info/info/unicode/category/index.htm the categories 
mentioning "Letters" include greek characters, accented characters, hiragana 
and cyrillic characters among others (most of which are invalid as C identifier 
names). Oddly, "marks" seem to include some kind of punctuation. I couldn't 
find a section that is obviously only "letters and marks" or two separate 
"letters" and "marks" sections.

Anyway, I think building your own custom character set from a string including 
the characters you *know* are valid identifiers in your target programming 
language(s) is probably the route of least surprise.

Cheers,
-- Uli Kusterer
"The Witnesses of TeachText are everywhere..."
http://hammer-language.com


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to