Re: Best guess at expressing a string as a variable

Uli Kusterer Wed, 23 Jan 2013 05:10:42 -0800

On Jan 23, 2013, at 2:18 AM, jonat...@mugginsoft.com wrote:
> Hmm. Maybe not. I want to keep the generated variable name legible.

You need to nail down the languages you want to deploy to, and then find out
what their criteria for identifiers are. Then you can decide to either generate
identical lowest-common-denominator-names for all of them (which is
[a-zA-Z_]([a-zA-Z_0-9]*) in the case of C, i.e. it may not start with a number
either), or adjust what characters to permit based on the target programming
language.

Apart from character set, you may also have to be aware of length limits etc.
Early C compilers, for instance, only used the first 8 characters of an
identifier. So "ExceptionalHouse" and "ExceptionalCow" both ended up as the
same identifier, "Exceptio". I'm hard pressed to think of a language with such
a limit today, but I don't know what languages your targeting. Maybe one has
such a limit.

If you have a case where you can't express a character in a particular
character set, you have several options:

1) Transcribe it to an equivalent character set. E.g. U-Umlaut (ü) is usually
written as "ue". However, you will then have to deal with collisions. E.g. what
if one user enters the word "Frauen", but another makes up a new word "Fraün".
The latter would transcribe to the former, and you might get unexpected side
effects. You might have to generate a look-up-table, and if you find a
collision like that, make the name unique again, e.g. by naming one "frauen"
and the other "frauen2". IIRC there are official transcriptions for many
languages, e.g. Romanji for Japanese characters.

2) Fail and tell the user what the valid characters are, and only let them
enter valid characters.

3) Transcribe in some other way, e.g. by base64-encoding, or using a
hex-representation of the given byte sequence, or whatever. This way you could
keep ASCII sentences valid, but modify everything else. But even then you could
have collisions. E.g. if you replace spaces with underscores, what if there's a
second version with the underscore?

> Is + (id)letterCharacterSet the best choice here?

According to the docs
(https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSCharacterSet_Class/Reference/Reference.html):
"An NSCharacterSet object represents a set of Unicode-compliant characters."
The +letterCharacterSet documentation says "Returns a character set containing
the characters in the categories Letters and Marks." So a Google later, here
http://www.fileformat.info/info/unicode/category/index.htm the categories
mentioning "Letters" include greek characters, accented characters, hiragana
and cyrillic characters among others (most of which are invalid as C identifier
names). Oddly, "marks" seem to include some kind of punctuation. I couldn't
find a section that is obviously only "letters and marks" or two separate
"letters" and "marks" sections.

Anyway, I think building your own custom character set from a string including
the characters you *know* are valid identifiers in your target programming
language(s) is probably the route of least surprise.

Cheers,
-- Uli Kusterer
"The Witnesses of TeachText are everywhere..."
http://hammer-language.com

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Best guess at expressing a string as a variable

Reply via email to