Thanks for the feedback Ken. Please see my comments inline.
Ken Krugler wrote:
Hi Brian,
Sorry for not doing a quick, full review. Some issues I thought of
while quickly reading over the Wiki page:
1. Lucene has support (tokenizers, stemmers, etc) for various
languages, but you'd need to be able to include these (as needed), and
also "know" which language is being processed to decide which
language-specific plugins to apply.
Any specific under the hood Lucene questions would need to be answered
by Andi Vajda who is the owner of PyLucene.
2. Related is the issue of using ICU to do searches inside of text,
versus indexed queries. I thought that was something you were going to
support in Chandler, right? Like I've got an email open, and I search
on some word.
If you're doing this, then you want language-specific, folded (e.g.
case insensitive) searching. ICU supports this, but it would require
additional work I think, similar to Lucene.
I believe as long as the attributes have indexText=True that PyLucene
will handle this case no problem. I have sent a mail to Andi to confirm
my assumption.
3. So along these lines, how do you "pick" the language, if it's not
specified? Sometimes you know the language from meta info (like on web
pages), but otherwise it seems like you'll probably just want to use
the user's OS language setting. There are other approaches that try to
detect the language, similar to charset detection, but that typically
isn't warranted for a general-purpose app like Chandler. Anyway I
think this should be called out as a design decision.
Yes the locale set will come from the Operating System. Although
mentioned already briefly in the spec I have added an explicit section
detailing how the locale set is determined. Thanks for the suggestion.
4. To ensure smooth interoperability with ICU, I assume that
Chandler's Python will always be built using UTF-16, not UTF-32,
right? Otherwise it seems like you won't be able to leverage direct
copying of data between Python and ICU strings.
In the swig code for PyICU, Andi checks the Python unicode objects type
(UCS-2 or UCS-4) when converting to and from ICU UnicodeStrings.
5. We'd talked about how big ICU code/data can be, and the need to
support installations of different language sets. Was that covered?
Yes it is big. I added a note to the spec that ICU size can
significantly be reduced by removing locale data files such as Hebrew
and Arabic which will not be supported in the Chandler 1.0 release.
6. I think somebody commented about the problems that can be caused by
translators messing up strings. You'd responded w/info about the ICU
message format. We'd talked about being able to do a consistency
check, comparing English to language X and validating that the
abstract structure of the message (number/type of parameters) hadn't
changed. Might be worth mentioning.
I added the consistency checker to the spec.
7. For doing a programmatic localization, you mentioned "Potential
tests are double the size of the LocalizableString text or insert in
each LocalizableString translation a non-8bit surrogate character
pair". I'm not sure what you mean by a non-8bit surrogate character pair.
Some tests you can do are:
a. Replace vowels with vowel + umlaut (Motley Crüe localization).
Other substitutions are possible as well (C -> Ç, etc)
b. Replace ASCII with full-width ASCII
Update the .6 spec to be more clear. When I stated non-8 bit surrogate
character pair what I really meant was a Unicode surrogate character
pair where a single displayable Glyph is represented by two or more
Unicode codepoints such as your example above of Motley Crüe which is a
u + a umlaut.
equivalents. so "help" becomes "ÇàÇÖÇåÇê", which also tests expanding
the width of text.
8. Do you mention the issue of making gettext use native OS fallback
settings?
The OS locale set will be determined by the Chandler I18nManager. The
gettext api has built in fallback support. Passing it a locale set array
is all gettext needs to perform the correct fallback behavior.
Related to this might be noting that using .po files might preclude
some Mac OS X localization customization by end users, since the
file/structure won't match what's standard for Mac apps.
Added a footnote to the spec addressing this point.
Anyway, it's 9pm so I'm off to put my daughter to bed. Hope this helps...
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
--
Brian Kirsch - Email Framework Engineer
Open Source Applications Foundation
543 Howard St. 5th Floor
San Francisco, CA 94105
(415) 946-3056
http://www.osafoundation.org
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev