On Friday 01 December 2006 16:46, Jason Tackaberry wrote: > On Fri, 2006-12-01 at 16:36 +0100, Duncan Webb wrote: > > First, when a string is a Unicode string does this mean that every > > character is 2 or 4 bytes wide? > > Not necessarily. Depends on the encoding. This isn't the case for > latin1 and UTF8. But Duncan asked for Unicode strings, how can those be latin8 or utf-8? AFAIK, u"Hans" will always be a 16bit string. Let's see.. interesting: http://docs.python.org/api/unicodeObjects.html says:
Python's default builds use a 16-bit type for Py_UNICODE and store Unicode
values internally as UCS2. It is also possible to build a UCS4 version of
Python (most recent Linux distributions come with UCS4 builds of Python).
These builds then use a 32-bit type for Py_UNICODE and store Unicode data
internally as UCS4. On platforms where wchar_t is available and compatible
with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias
for wchar_t to enhance native platform compatibility. On all other platforms,
Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned
long (UCS4).
You may encode Unicode strings down to 8-bit strings with the "encode" member
function: u"Hans".encode("utf-8") will make it an 8 bit unicode string again,
but that string has no property which says that it's in UTF-8 encoding, which
is why using Unicode objects where possible is best for i18n.
> > Second, file names from a fat system seem to be in latin1 but on the
> > ext2/3 are in utf8. How can they be processed in a safe way without
> > causing UnicodeErrors?
>
> Firstly, the encoding type is not always utf8 on ext3. The filesystem
> encoding can be gotten via sys.getfilesystemencoding(), but that doesn't
> mean a filename isn't encoded latin1 anyway. Consequently, you must
> never use unicode for storing filenames, and always keep them as str
> objects.
I agree with this part of your answer, but...
> For purposes of displaying a filename you can then convert to unicode
> for proper display. kaa.strutils.str_to_unicode attempts to do the
> right thing when you don't know whether a string is encoded latin1 or
> utf8. (kaa.strutils is in kaa.base, you can just copy that function
> into the 1.x tree if you need it.)
..I find this misleading, since with a properly setup system, you *should*
know which encoding the filename has. I am used to Qt, which has
QFile.encodeName() and QFile.decodeName() for proper 8bit<->Unicode
conversions. What a pity that Python lacks such useful functions.
But as I see now, str_to_unicode properly tries the user's locale first.
I wonder if there should be an additional filename_to_unicode function which
uses sys.getfilesystemencoding() instead of strutils.ENCODING?
def path_to_unicode(s):
"""
Attempts to convert a local filesystem path to a unicode string.
First it tries to decode the string based on
sys.getfilesystemencoding(). If that fails, it uses
str_to_unicode() as a fallback (which in turn tries the
locale's preferred encoding, UTF-8, and latin-1 in order).
"""
if not type(s) == str:
return s
try:
return s.decode(sys.getfilesystemencoding())
except UnicodeDecodeError:
pass
return str_to_unicode(s)
--
Ciao, / / .o.
/--/ ..o
/ / ANS ooo
pgpTHQoYTwWFE.pgp
Description: PGP signature
------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________ Freevo-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/freevo-devel
