On 23 Jan 2013, at 13:07, Uli Kusterer <[email protected]> wrote:

> On Jan 23, 2013, at 2:18 AM, [email protected] wrote:
>> Hmm. Maybe not. I want to keep the generated variable name legible.
> 
> You need to nail down the languages you want to deploy to, and then find out 
> what their criteria for identifiers are. Then you can decide to either 
> generate identical lowest-common-denominator-names for all of them (which is 
> [a-zA-Z_]([a-zA-Z_0-9]*) in the case of C, i.e. it may not start with a 
> number either), or adjust what characters to permit based on the target 
> programming language.
This is known http://www.mugginsoft.com/kosmictask/help/languages.
The app uses a plugin-architecture so more may appear.


> 
> Apart from character set, you may also have to be aware of length limits etc. 
> Early C compilers, for instance, only used the first 8 characters of an 
> identifier. So "ExceptionalHouse" and "ExceptionalCow" both ended up as the 
> same identifier, "Exceptio". I'm hard pressed to think of a language with 
> such a limit today, but I don't know what languages your targeting. Maybe one 
> has such a limit.
The plugin defines the language properties so length constraints can be 
included. 
Som experimentation will  determine the limits.

> 
> If you have a case where you can't express a character in a particular 
> character set, you have several options:
> 
> 1) Transcribe it to an equivalent character set. E.g. U-Umlaut (ü) is usually 
> written as "ue". However, you will then have to deal with collisions. E.g. 
> what if one user enters the word "Frauen", but another makes up a new word 
> "Fraün". The latter would transcribe to the former, and you might get 
> unexpected side effects. You might have to generate a look-up-table, and if 
> you find a collision like that, make the name unique again, e.g. by naming 
> one "frauen" and the other "frauen2". IIRC there are official transcriptions 
> for many languages, e.g. Romanji for Japanese characters.
> 
> 2) Fail and tell the user what the valid characters are, and only let them 
> enter valid characters.
> 
> 3) Transcribe in some other way, e.g. by base64-encoding, or using a 
> hex-representation of the given byte sequence, or whatever. This way you 
> could keep ASCII sentences valid, but modify everything else. But even then 
> you could have collisions. E.g. if you replace spaces with underscores, what 
> if there's a second version with the underscore?
I was intending to decompose U-Umlaut (ü) to u + Umlaut and then discard the 
umlaut if possible. Or perhaps an API exists to decompose the likes of U-Umlaut 
(ü)  to ue.
I already have collision detection code that appends integers for uniqueness.

> 
>> Is  + (id)letterCharacterSet the best choice here?
> 
> According to the docs 
> (https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSCharacterSet_Class/Reference/Reference.html):
>  "An NSCharacterSet object represents a set of Unicode-compliant characters." 
> The +letterCharacterSet documentation says "Returns a character set 
> containing the characters in the categories Letters and Marks." So a Google 
> later, here http://www.fileformat.info/info/unicode/category/index.htm
Thanks for the link. I didn't know that the categories were specified by 
unicode. I had assumed they were arbitrarily defined by Apple.

> the categories mentioning "Letters" include greek characters, accented 
> characters, hiragana and cyrillic characters among others (most of which are 
> invalid as C identifier names). Oddly, "marks" seem to include some kind of 
> punctuation. I couldn't find a section that is obviously only "letters and 
> marks" or two separate "letters" and "marks" sections.
I see that. Anyhow, I can have a look at the likes of +nonBaseCharacterSet and 
see how they correlate exactly with the Unicode categories.
 
> 
> Anyway, I think building your own custom character set from a string 
> including the characters you *know* are valid identifiers in your target 
> programming language(s) is probably the route of least surprise.
> 
Agree.
I want to get a sensible wide base and restrict it on a per language basis.

Thanks for such a detailed reply.

Jonathan
_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

Reply via email to