On 03/22/2011 03:06 AM, David Malcolm wrote:
[CCing Panu Matilainen, the maintainer of rpm, or, at least rpm 4.*,
which is what all major distributions are using AIUI]
On Mon, 2011-03-21 at 10:50 +0100, "Martin v. Löwis" wrote:
Am 21.03.2011 07:37, schrieb Prashant Kumar:
Hello,
My name is Prashant Kumar and I've worked on porting few python
libraries(distutils2, configobj) and I've been looking at the ideas
list for GSoC for a project related to porting.
I came across [1] and found it interesting. It mentions that some
Hi Prashant! Thanks for the interest.
Panu: [1] is http://wiki.python.org/moin/RPMOnPython3 , a Google Summer
of Code proposal to work on the Python 3 bindings to RPM.
of the work has already been done; I would like to look at the code
repository for the same, could someone provide me the link for the
same?
Not so much the code but the person who did the porting. This was Dave
Malcolm (CC'ed); please get in touch with him. Please familiarize
yourself with the existing Python bindings (in the latest RPM 4 release
from rpm.org). You'll notice that this already has Python 3 support;
not sure whether that's the most recent code, though.
Panu Matilainen also worked on the python 3 port of the librpm python
bindings.
For the rpm source code, see: http://rpm.org/wiki/GetSource (the python
bindings are in a subdirectory of the main source tree).
My initial patchbomb landed on the mailing list here:
http://lists.rpm.org/pipermail/rpm-maint/2009-October/002528.html
and Panu committed and fixed up the patches around then.
My understanding is that the current status is that the bindings work,
but all values that were formerly exposed to Python 2 as "str" are now
exposed to Python 3 as "bytes", which would require changing all
consumers of the code.
That's more or less where it stands.
I believe Panu has also been working on a rewrite of the Python
bindings, since the existing code is a little messy.
Panu, am I remembering this correctly?
The python binding rewrite was abandoned (it just didn't work out for
various reasons) and usable bits merged into the existing bindings.
So yes you're correct - there /was/ such a thing but any new work should
go to the bindings that exist in the main rpm source tree.
The idea is that these types are fundamentally string-like, but
unfortunately rpm has always been a bit loose in its interpretation of
the encoding of byte values in package files and package databases.
There are millions of rpm files out there, and millions of rpm
databases, and all of these are in _some_ encoding. I have seen
specfiles in which parts of the file were encoded in UTF-8 and other
parts were encoded in Latin-1 (this broke one of my python scripts
horribly).
More precisely, it's not being "a bit loose" about encoding, rpm simply
doesn't know diddly about encodings and does not make any assumptions or
interpretations about them. A string in rpm is just a sequence of
arbitrary non-zero byte values terminated with \0.
Martin and I discussed this last week at PyCon. I believe the proposal
that we came up with was:
- try to interpret bytes as UTF-8, using the "surrogateescape"
mechanism, so that if it fails, we can at least preserve the exact bytes
and round-trip
Right, based on a quick skim of the surrogateescape PEP, that seems like
a reasonable approach (rpm is much like the traditional POSIX interfaces
which simply do not deal with encoding at all)
Ultimately, this does mean trying to impose some kind of encoding
standard on rpm files and rpm databases, which I think would be a Good
Thing, but is perhaps something of scope creep compared to what the
proposal at [1] says. See e.g. http://rpm.org/ticket/30
Note that any frpm forced encoding standard could only affect new
packages, but rpm and the bindings need to be able deal with all the
junk out in the wild pretty much forever.
Other ideas that occur:
- does rpmlint check for encoding yet?
IIRC rpmlint can (depending on config probably) check for encoding of
the paths and the spec itself. However this still doesn't guarantee all
the string-data in header to be utf, as practically any part(s) of the
data can come from macros, which are not encoding-aware either.
- what to do e.g. about canonicalization? What happens if one rpm
provide a feature named "café" (where the "é" is U+00E9) and another rpm
requires a feature named "café" (where the "é" is U+0065 LATIN SMALL
LETTER E + U+0301 COMBINING ACUTE ACCENT)? IIRC we ruled that rpms in
Fedora had to have ASCII names, and I'm guessing this applies to
metadata, but we do allow UTF-8 filenames within package payloads
(again, IIRC)
Ouch. Did I already mention that UTF and the encoding business makes my
head hurt? I guess I didn't, can't think straight because by now I have
that headache...
Anyway, pretty much all rules in this area are distro specific, as rpm
doesn't enforce anything wrt encoding.
The bindings cannot go changing header contents to their liking, so any
canonicalization would have to go into rpm proper, the build-side of
things to be exact so the runtime doesn't have to care. Requiring rpm to
fiddle with encodings + canonicalization for every single string it
processes at runtime would require enormous changes throughout rpm, and
presumably at a massive performance cost too.
- Panu -
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com