Thanks for the feedback Ken. Please see my comments inline.

Ken Krugler wrote:

Hi Brian,

Sorry for not doing a quick, full review. Some issues I thought of while quickly reading over the Wiki page:

1. Lucene has support (tokenizers, stemmers, etc) for various languages, but you'd need to be able to include these (as needed), and also "know" which language is being processed to decide which language-specific plugins to apply.

Any specific under the hood Lucene questions would need to be answered by Andi Vajda who is the owner of PyLucene.



2. Related is the issue of using ICU to do searches inside of text, versus indexed queries. I thought that was something you were going to support in Chandler, right? Like I've got an email open, and I search on some word.

If you're doing this, then you want language-specific, folded (e.g. case insensitive) searching. ICU supports this, but it would require additional work I think, similar to Lucene.


I believe as long as the attributes have indexText=True that PyLucene will handle this case no problem. I have sent a mail to Andi to confirm my assumption.


3. So along these lines, how do you "pick" the language, if it's not specified? Sometimes you know the language from meta info (like on web pages), but otherwise it seems like you'll probably just want to use the user's OS language setting. There are other approaches that try to detect the language, similar to charset detection, but that typically isn't warranted for a general-purpose app like Chandler. Anyway I think this should be called out as a design decision.

Yes the locale set will come from the Operating System. Although mentioned already briefly in the spec I have added an explicit section detailing how the locale set is determined. Thanks for the suggestion.


4. To ensure smooth interoperability with ICU, I assume that Chandler's Python will always be built using UTF-16, not UTF-32, right? Otherwise it seems like you won't be able to leverage direct copying of data between Python and ICU strings.

In the swig code for PyICU, Andi checks the Python unicode objects type (UCS-2 or UCS-4) when converting to and from ICU UnicodeStrings.



5. We'd talked about how big ICU code/data can be, and the need to support installations of different language sets. Was that covered?

Yes it is big. I added a note to the spec that ICU size can significantly be reduced by removing locale data files such as Hebrew and Arabic which will not be supported in the Chandler 1.0 release.


6. I think somebody commented about the problems that can be caused by translators messing up strings. You'd responded w/info about the ICU message format. We'd talked about being able to do a consistency check, comparing English to language X and validating that the abstract structure of the message (number/type of parameters) hadn't changed. Might be worth mentioning.

I added the consistency checker to the spec.


7. For doing a programmatic localization, you mentioned "Potential tests are double the size of the LocalizableString text or insert in each LocalizableString translation a non-8bit surrogate character pair". I'm not sure what you mean by a non-8bit surrogate character pair.

Some tests you can do are:

a. Replace vowels with vowel + umlaut (Motley Crüe localization). Other substitutions are possible as well (C -> Ç, etc)

b. Replace ASCII with full-width ASCII


Update the .6 spec to be more clear. When I stated non-8 bit surrogate character pair what I really meant was a Unicode surrogate character pair where a single displayable Glyph is represented by two or more Unicode codepoints such as your example above of Motley Crüe which is a u + a umlaut.

equivalents. so "help" becomes "ÇàÇÖÇåÇê", which also tests expanding the width of text.

8. Do you mention the issue of making gettext use native OS fallback settings?


The OS locale set will be determined by the Chandler I18nManager. The gettext api has built in fallback support. Passing it a locale set array is all gettext needs to perform the correct fallback behavior.

Related to this might be noting that using .po files might preclude some Mac OS X localization customization by end users, since the file/structure won't match what's standard for Mac apps.


Added a footnote to the spec addressing this point.


Anyway, it's 9pm so I'm off to put my daughter to bed. Hope this helps...

-- Ken

--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200



--
Brian Kirsch - Email Framework Engineer
Open Source Applications Foundation
543 Howard St. 5th Floor
San Francisco, CA 94105
(415) 946-3056
http://www.osafoundation.org

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

Reply via email to