Re: Proposal for declinations in gettext

Danilo Segan Fri, 13 Jun 2003 23:57:01 -0700

Miloslav Trmac wrote:

Hello, On Fri, Jun 13, 2003 at 10:14:25PM +0200, Danilo Segan wrote:
msgid "king"
msgstr<0> "kralj"
msgstr<3> "kralja"
msgstr<5> "kraljem"
msgid "move %s"
msgstr "premesti %<3>s"
<i>, where i=0 .. (PO-Number-of-noun-forms)-1, is the index of the form required, and it depends on the sentence construction. It is determined by the verb, or perhaps words like "with", "whom", ... Some of msgstr<i>'s can be omitted if it's known not to be used in composition (most are highly unlikely to be ever used in translations, like the "vocative" form of "hey %s").
I suppose the obvious question is: "How do I know which declinations
a word is used in?" (0, 3, 5 in the example above). In order to solve
this, you'd have to somehow mark "move %{move_card}s" and "{move_card}king"
so that they could be matched, msg$SOMETHING would then check for
missing/unneeded declinations. I don't think this is much easier
than using similar tags to explicitly mark contexts words are used in
(see below).

Well, the numbers you choose for "declinations" can be arbitrary: the software would not force anything on you. Of course, it's also possible to use some keywords ("move_cards"), but I think that's harder on the implementation (not that all else is easy :-)).

The good side of this approach (the syntactic elements are arbitrary, don't comment on those) is that programs that use gettext for l10n would need no change:

Wrong. A typical gettext usage of the above is in principle something like printf (gettext ("move %s"), gettext ("king")); that is, there is currently no way to correlate the "move %s" with the "king".

With some hacks, it could be made to work transparently for the programmer. The idea would be to use preprocessor to redefine "printf" and similar functions, and a gettext("king") to return an array if available (again, one could use obnoxious hacks for this, like putting some structure pointer behind the \0 byte, or perhaps using some magic number in a string that would indicate that it is actually a pointer to that same structure).

A bit better solution would be to just replace all instances of *printf functions with *printf( gettext_printf(format, parameters) ), but this would still require hacks if we're to maintain some compatibility with those programs that use: char *s=gettext("king"); printf(gettext("move %s"), s);

Of course, I admit that thorough changes would be best in terms of applications, and interface. I'll forward a message with one kind of proposal which would hold context in a single variable that <jmaiorana at idirect.net> sent to the linux-utf8 mailing list.

Before diving into gettext code, it'd be nice to hear if this kind of approach would work for any language other than Serbian (I repeat, I find it likely to work for Slavic languages, and German, those being the languages I'm at least a bit familiar with).

It looks general enough to work for any language (if you define enough declinations), but I'm not sure this is the way to solve this. Doing the declination in my head is just too much work :-)

You don't have to do all the declinations. The translations usually require two or three out of seven available in Serbian language (for instance) -- I guess it would be similar for other languages. In cases like that, I could even define "number-of-declinations" to be 3, and use them according to how common they are. The important thing is that there is an opportunity for translators to fix things.

Still, I'm not sure it would work for "any" language: we're still talking in terms of Slavic languages, right? (Czech, Serbian,...) Almost noone else commented on this regarding other, non-similar languages.

The approach seems to easily lend itself to creating a single word-form database; then you'd want a database of which declinations are used in which verb forms, and in a few months gettext might be trying to do universal machine translation. But then again, maybe gettext maintainers want it to do that.

Well, this kind of approach would certainly be helped with a word database, but I don't find it as a requirement.

And just to be clear, I am not involved with gettext maintainers, so don't blame them for any of my brain-dump :-)

What I'd like to see and waht I think would go some way towards helping these problems is integrated support for context markers. E.g. in nautilus, we have strings like "[files that are] named [README]" which is much better than just "named". Currently, every program does this differently (nautilus, KDE, gnucash at least).

Unfortunately, this doesn't work quite well. In fact, Nautilus is not the example one should be proud of (in terms of l10n).

There were numerous issues with plural-forms themselves in 2.2.x releases (guess they're fixed in 2.3.x), and the solution used by Nautilus would solve one problem (that of having the correct form for "named"), but would still solve no problem for "[files that are]".

Here's a particular example from Nautilus translation (I'll use english strings to describe problem): #: libnautilus-private/nautilus-search-uri.c:325 msgid "[Items ]modified today" msgstr "modified today"

The problem here is that in Serbian (I did the Nautilus translation, so I know what I'm talking about :-)), the correct (or at least a way better) form would be "Today modified items", instead of "Items modified today". Or, it could also be "Items that are modified today", which doesn't follow the pattern, and should be composed like some other strings ("[Items that are ]named[ README]"). If a translator would translate it as "that are modified today", it might work for this particular example, but it might be used in inappropriate ways (s)he doesn't know about.

So, here would printf format strings be much more appropriate, because order could be reversed and manipulated in "free style". Approach with [context] markers instead of format strings might work for many languages, but it wouldn't work for all -- actually, it would be wrong in some. So, I believe this kind of context information belongs in comments-to-translators, which xgettext also extracts without problems.

What my approach is to solve, is that once context information is available (whether a translator ran the program in question, and discovered how some strings are composed "incorrectly", or the programmer provided that kind of information on composition), translator has the possibility to make it work for his own language. So, you provide declinations only when you know they're needed.

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Proposal for declinations in gettext

Reply via email to