Hi Nir, I think latin1, How do you think we should handle latin1 errors then? Replace on latin1 or replace on utf-8?
for codec in ["utf8", "latin1"]: try: return decode(b, codec) except: pass return decode(b, "utf8", errors="replace") (Pseudocode, will be implemented in c) On Thu, Apr 23, 2020, 21:34 Nir Soffer <nsof...@redhat.com> wrote: > On Mon, Apr 20, 2020 at 3:38 PM Sam Eiderman <sam...@google.com> wrote: > > > > The python3 bindings create unicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded - however in some cases they are not a valid unicode > > string, > > So what are they? latin1 maybe? > > Maybe use: > > try: > value.decode("utf-8") > except UnicodeDecodeError: > value.decode("latin1") > > This will always succeed, producing possibly garbage output but so is > errors='replace'. > > > on SLES11 SP4 the following packages fail to be converted to > > unicode using guestfs_int_py_fromstring() (which invokes > > PyUnicode_FromString()): > > > > PackageKit > > aaa_base > > coreutils > > dejavu > > desktop-data-SLED > > gnome-utils > > hunspell > > hunspell-32bit > > hunspell-tools > > libblocxx6 > > libexif > > libgphoto2 > > libgtksourceview-2_0-0 > > libmpfr1 > > libopensc2 > > libopensc2-32bit > > liborc-0_4-0 > > libpackagekit-glib10 > > libpixman-1-0 > > libpixman-1-0-32bit > > libpoppler-glib4 > > libpoppler5 > > libsensors3 > > libtelepathy-glib0 > > m4 > > opensc > > opensc-32bit > > permissions > > pinentry > > poppler-tools > > python-gtksourceview > > splashy > > syslog-ng > > tar > > tightvnc > > xorg-x11 > > xorg-x11-xauth > > yast2-mouse > > > > Fix this by globally changing guestfs_int_py_fromstring() > > and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace" > > error handler: > > > > https://docs.python.org/3/library/codecs.html#error-handlers > > > > For example, this will decode PackageKit's description on SLES4 the > > following way: > > > > Backend: pisi > > S.�ağlar Onur <cag...@pardus.org.tr> > > What is the original text? > > Nir > > > Signed-off-by: Sam Eiderman <sam...@google.com> > > --- > > python/handle.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/python/handle.c b/python/handle.c > > index 2fb8c18f0..427424707 100644 > > --- a/python/handle.c > > +++ b/python/handle.c > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromString (str); > > #else > > - return PyUnicode_FromString (str); > > + return PyUnicode_Decode(str, strlen(str), "utf-8", "replace"); > > #endif > > } > > > > @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, > size_t size) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromStringAndSize (str, size); > > #else > > - return PyUnicode_FromStringAndSize (str, size); > > + return PyUnicode_Decode(str, size, "utf-8", "replace"); > > #endif > > } > > > > -- > > 2.26.1.301.g55bc3eb7cb9-goog > > > > > > _______________________________________________ > > Libguestfs mailing list > > Libguestfs@redhat.com > > https://www.redhat.com/mailman/listinfo/libguestfs > >
_______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://www.redhat.com/mailman/listinfo/libguestfs