On 23 Jan 2013, at 13:07, Uli Kusterer <[email protected]> wrote:
> On Jan 23, 2013, at 2:18 AM, [email protected] wrote: >> Hmm. Maybe not. I want to keep the generated variable name legible. > > You need to nail down the languages you want to deploy to, and then find out > what their criteria for identifiers are. Then you can decide to either > generate identical lowest-common-denominator-names for all of them (which is > [a-zA-Z_]([a-zA-Z_0-9]*) in the case of C, i.e. it may not start with a > number either), or adjust what characters to permit based on the target > programming language. This is known http://www.mugginsoft.com/kosmictask/help/languages. The app uses a plugin-architecture so more may appear. > > Apart from character set, you may also have to be aware of length limits etc. > Early C compilers, for instance, only used the first 8 characters of an > identifier. So "ExceptionalHouse" and "ExceptionalCow" both ended up as the > same identifier, "Exceptio". I'm hard pressed to think of a language with > such a limit today, but I don't know what languages your targeting. Maybe one > has such a limit. The plugin defines the language properties so length constraints can be included. Som experimentation will determine the limits. > > If you have a case where you can't express a character in a particular > character set, you have several options: > > 1) Transcribe it to an equivalent character set. E.g. U-Umlaut (ü) is usually > written as "ue". However, you will then have to deal with collisions. E.g. > what if one user enters the word "Frauen", but another makes up a new word > "Fraün". The latter would transcribe to the former, and you might get > unexpected side effects. You might have to generate a look-up-table, and if > you find a collision like that, make the name unique again, e.g. by naming > one "frauen" and the other "frauen2". IIRC there are official transcriptions > for many languages, e.g. Romanji for Japanese characters. > > 2) Fail and tell the user what the valid characters are, and only let them > enter valid characters. > > 3) Transcribe in some other way, e.g. by base64-encoding, or using a > hex-representation of the given byte sequence, or whatever. This way you > could keep ASCII sentences valid, but modify everything else. But even then > you could have collisions. E.g. if you replace spaces with underscores, what > if there's a second version with the underscore? I was intending to decompose U-Umlaut (ü) to u + Umlaut and then discard the umlaut if possible. Or perhaps an API exists to decompose the likes of U-Umlaut (ü) to ue. I already have collision detection code that appends integers for uniqueness. > >> Is + (id)letterCharacterSet the best choice here? > > According to the docs > (https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSCharacterSet_Class/Reference/Reference.html): > "An NSCharacterSet object represents a set of Unicode-compliant characters." > The +letterCharacterSet documentation says "Returns a character set > containing the characters in the categories Letters and Marks." So a Google > later, here http://www.fileformat.info/info/unicode/category/index.htm Thanks for the link. I didn't know that the categories were specified by unicode. I had assumed they were arbitrarily defined by Apple. > the categories mentioning "Letters" include greek characters, accented > characters, hiragana and cyrillic characters among others (most of which are > invalid as C identifier names). Oddly, "marks" seem to include some kind of > punctuation. I couldn't find a section that is obviously only "letters and > marks" or two separate "letters" and "marks" sections. I see that. Anyhow, I can have a look at the likes of +nonBaseCharacterSet and see how they correlate exactly with the Unicode categories. > > Anyway, I think building your own custom character set from a string > including the characters you *know* are valid identifiers in your target > programming language(s) is probably the route of least surprise. > Agree. I want to get a sensible wide base and restrict it on a per language basis. Thanks for such a detailed reply. Jonathan _______________________________________________ Cocoa-dev mailing list ([email protected]) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [email protected]
