Re: [Bug-gnupod] Encoding of non-ascii characters in GNUtunesDB.xml

H. Langos Mon, 14 Apr 2008 16:27:03 -0700

Patch of the patch ... performance is better but still could be improved
i guess.


Instead of making 4 utf8 conversions and 3 substring operations on each
character we are down to one ord() and one substr() per character. Still
bad but way better than before.

-henrik

PS: Anybody interested in getting complete usable files instead of
patches?

On Mon, Apr 14, 2008 at 08:12:30PM +0200, H. Langos wrote:
> 
> Ok, here's the patch ...
> 
> Took longer than I thought because UTF8 in perl is a major pain.
> 
> cheers
> -henrik
> 
> PS: The line "$xutf =~ tr/\000-\037//d;" is not without problems. It
> will reduce all control characters to nothing including TAB, LF, 
> and CR eventhough they are valid XML characters. 
> 
> Could somebody check out how iTunes handles those? Does it also remove 
> those characters or does it convert them into &#9; and so on?
> 
> 
> On Mon, Apr 14, 2008 at 02:14:18PM +0200, H. Langos wrote:
> > Hi there,
> > 
> > I wonder If anybody else has the ocassional problem with editing her/his
> > GNUtunesDB.xml. 
> > 
> > Since it is XML and the encoding is UTF-8 you don't have any problem as
> > long as your system is completely UTF-8 compliant. I however have a
> > mixed iso-8859-1 iso-8859-15 and UTF-8 mess and some of the editors 
> > that I like to use are not very smart about handling the character 
> > encoding.
> > 
> > It would be very easy to convert everything outsite the ascii range to 
> > the XML escaped version. So say, instead of some garbage you'd see 
> > "&#347;" where a "Latin Small Letter s with Acute" is.
> > 
> > Pro: GNUtunesDB.xml becomes a pure ascii file. No more editor/viewer 
> >   issues.
> > 
> > Contra: The GNUtunesDB.xml becomes slightly bigger and for people with a
> >   clean UTF-8 toolchain it becomes a little less readable. (Note: You can
> >   still edit the file and insert native UTF-8 as you please.)
> > 
> > Any thoughts?
> > 
> > cheers
> > -henrik
> > 
> > 
> > 
> > _______________________________________________
> > Bug-gnupod mailing list
> > [email protected]
> > http://lists.nongnu.org/mailman/listinfo/bug-gnupod

> commit 5ce6a9e9173dce95287ff4b15deda67b569dd365
> Author: Heinrich Langos <[EMAIL PROTECTED]>
> Date:   Mon Apr 14 19:49:54 2008 +0200
> 
>     Changed encoding of unicode characters outside of ascii range to XML 
> notation.
>     
>     This change will make your GNUtunesDB.xml into a pure ascii file. Making 
> it
>     easier to view and manipulate on non-utf8 capable systems.
>     
>     Note: "xescaped()" is not only called for attribute values but also for
>     element names and attribute names. So if sombody comes up with non-ascii
>     element names or attribute names we would have to treat those differently.
> 
> diff --git a/src/ext/XMLhelper.pm b/src/ext/XMLhelper.pm
> index 5eaeb48..2a230a3 100755
> --- a/src/ext/XMLhelper.pm
> +++ b/src/ext/XMLhelper.pm
> @@ -124,8 +124,15 @@ sub xescaped {
>       my $xutf = Unicode::String::utf8($ret)->utf8;
>       #Remove 0x00 - 0x1f chars (we don't need them)
>       $xutf =~ tr/\000-\037//d;
> -     
> -     return $xutf;
> +     my $out = Unicode::String::utf8("")->utf8;
> +     for (my $i = 0 ; $i < Unicode::String::utf8($xutf)->length ; $i++) {
> +             if (Unicode::String::utf8($xutf)->substr($i,1)->ord > 127) {
> +                     $out .= '&#' . 
> Unicode::String::utf8($xutf)->substr($i,1)->ord . ';';
> +             } else {
> +                     $out .= Unicode::String::utf8($xutf)->substr($i,1) ;
> +             }
> +     }
> +     return $out;
>  }
>  
>  

> _______________________________________________
> Bug-gnupod mailing list
> [email protected]
> http://lists.nongnu.org/mailman/listinfo/bug-gnupod

commit 1ace27099b20dfc5bb08fe764b30c9c2276729a9
Author: Heinrich Langos <[EMAIL PROTECTED]>
Date:   Tue Apr 15 01:15:21 2008 +0200

    Improved performance of utf8 to ascii encoding.

diff --git a/src/ext/XMLhelper.pm b/src/ext/XMLhelper.pm
index 2a230a3..b1ab134 100755
--- a/src/ext/XMLhelper.pm
+++ b/src/ext/XMLhelper.pm
@@ -124,12 +124,14 @@ sub xescaped {
 	my $xutf = Unicode::String::utf8($ret)->utf8;
 	#Remove 0x00 - 0x1f chars (we don't need them)
 	$xutf =~ tr/\000-\037//d;
-	my $out = Unicode::String::utf8("")->utf8;
-	for (my $i = 0 ; $i < Unicode::String::utf8($xutf)->length ; $i++) {
-		if (Unicode::String::utf8($xutf)->substr($i,1)->ord > 127) {
-			$out .= '&#' . Unicode::String::utf8($xutf)->substr($i,1)->ord . ';';
+	my $u16 = Unicode::String::utf8($xutf);
+	my $out = ""; #pure ascii
+	for (my $i = 0 ; $i < Unicode::String::length($u16); $i++) {
+		my $ccode = Unicode::String::substr($u16,$i,1)->ord;
+		if ($ccode > 127) {
+			$out .= '&#' . $ccode . ';';
 		} else {
-			$out .= Unicode::String::utf8($xutf)->substr($i,1) ;
+			$out .= chr($ccode) ;
 		}
 	}
 	return $out;

_______________________________________________
Bug-gnupod mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/bug-gnupod

Re: [Bug-gnupod] Encoding of non-ascii characters in GNUtunesDB.xml

Reply via email to