Re: [Python-Dev] [GSoC] Porting on RPM3

Panu Matilainen Tue, 22 Mar 2011 05:21:57 -0700

On 03/22/2011 03:06 AM, David Malcolm wrote:

[CCing Panu Matilainen, the maintainer of rpm, or, at least rpm 4.*,
which is what all major distributions are using AIUI]


On Mon, 2011-03-21 at 10:50 +0100, "Martin v. Löwis" wrote:

Am 21.03.2011 07:37, schrieb Prashant Kumar:

Hello,
     My name is  Prashant Kumar and I've worked on porting few python
libraries(distutils2, configobj) and I've been looking at the ideas
list for GSoC for a project related to porting.

     I came across [1]  and found it interesting. It mentions that some


Hi Prashant!  Thanks for the interest.

Panu: [1] is http://wiki.python.org/moin/RPMOnPython3 , a Google Summer
of Code proposal to work on the Python 3 bindings to RPM.

of the work has already been done; I would like to look at the code
repository for the same, could someone provide me the link for the
same?

Not so much the code but the person who did the porting. This was Dave
Malcolm (CC'ed); please get in touch with him. Please familiarize
yourself with the existing Python bindings (in the latest RPM 4 release
from rpm.org). You'll notice that this already has Python 3 support;
not sure whether that's the most recent code, though.


Panu Matilainen also worked on the python 3 port of the librpm python
bindings.

For the rpm source code, see: http://rpm.org/wiki/GetSource  (the python
bindings are in a subdirectory of the main source tree).

My initial patchbomb landed on the mailing list here:
   http://lists.rpm.org/pipermail/rpm-maint/2009-October/002528.html
and Panu committed and fixed up the patches around then.

My understanding is that the current status is that the bindings work,
but all values that were formerly exposed to Python 2 as "str" are now
exposed to Python 3 as "bytes", which would require changing all
consumers of the code.


That's more or less where it stands.

I believe Panu has also been working on a rewrite of the Python
bindings, since the existing code is a little messy.
Panu, am I remembering this correctly?

The python binding rewrite was abandoned (it just didn't work out forvarious reasons) and usable bits merged into the existing bindings.So yes you're correct - there /was/ such a thing but any new work shouldgo to the bindings that exist in the main rpm source tree.

The idea is that these types are fundamentally string-like, but
unfortunately rpm has always been a bit loose in its interpretation of
the encoding of byte values in package files and package databases.
There are millions of rpm files out there, and millions of rpm
databases, and all of these are in _some_ encoding.  I have seen
specfiles in which parts of the file were encoded in UTF-8 and other
parts were encoded in Latin-1 (this broke one of my python scripts
horribly).

More precisely, it's not being "a bit loose" about encoding, rpm simplydoesn't know diddly about encodings and does not make any assumptions orinterpretations about them. A string in rpm is just a sequence ofarbitrary non-zero byte values terminated with \0.

Martin and I discussed this last week at PyCon.  I believe the proposal
that we came up with was:
   - try to interpret bytes as UTF-8, using the "surrogateescape"
mechanism, so that if it fails, we can at least preserve the exact bytes
and round-trip

Right, based on a quick skim of the surrogateescape PEP, that seems likea reasonable approach (rpm is much like the traditional POSIX interfaceswhich simply do not deal with encoding at all)

Ultimately, this does mean trying to impose some kind of encoding
standard on rpm files and rpm databases, which I think would be a Good
Thing, but is perhaps something of scope creep compared to what the
proposal at [1] says.  See e.g. http://rpm.org/ticket/30

Note that any frpm forced encoding standard could only affect newpackages, but rpm and the bindings need to be able deal with all thejunk out in the wild pretty much forever.


Other ideas that occur:
   - does rpmlint check for encoding yet?

IIRC rpmlint can (depending on config probably) check for encoding ofthe paths and the spec itself. However this still doesn't guarantee allthe string-data in header to be utf, as practically any part(s) of thedata can come from macros, which are not encoding-aware either.

   - what to do e.g. about canonicalization?  What happens if one rpm
provide a feature named "café" (where the "é" is U+00E9) and another rpm
requires a feature named "café" (where the "é" is U+0065 LATIN SMALL
LETTER E + U+0301 COMBINING ACUTE ACCENT)?  IIRC we ruled that rpms in
Fedora had to have ASCII names, and I'm guessing this applies to
metadata, but we do allow UTF-8 filenames within package payloads
(again, IIRC)

Ouch. Did I already mention that UTF and the encoding business makes myhead hurt? I guess I didn't, can't think straight because by now I havethat headache...

Anyway, pretty much all rules in this area are distro specific, as rpmdoesn't enforce anything wrt encoding.

The bindings cannot go changing header contents to their liking, so anycanonicalization would have to go into rpm proper, the build-side ofthings to be exact so the runtime doesn't have to care. Requiring rpm tofiddle with encodings + canonicalization for every single string itprocesses at runtime would require enormous changes throughout rpm, andpresumably at a massive performance cost too.


        - Panu -

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [GSoC] Porting on RPM3

Reply via email to