Patch of the patch ... performance is better but still could be improved
i guess.
Instead of making 4 utf8 conversions and 3 substring operations on each
character we are down to one ord() and one substr() per character. Still
bad but way better than before.
-henrik
PS: Anybody interested in getting complete usable files instead of
patches?
On Mon, Apr 14, 2008 at 08:12:30PM +0200, H. Langos wrote:
>
> Ok, here's the patch ...
>
> Took longer than I thought because UTF8 in perl is a major pain.
>
> cheers
> -henrik
>
> PS: The line "$xutf =~ tr/\000-\037//d;" is not without problems. It
> will reduce all control characters to nothing including TAB, LF,
> and CR eventhough they are valid XML characters.
>
> Could somebody check out how iTunes handles those? Does it also remove
> those characters or does it convert them into 	 and so on?
>
>
> On Mon, Apr 14, 2008 at 02:14:18PM +0200, H. Langos wrote:
> > Hi there,
> >
> > I wonder If anybody else has the ocassional problem with editing her/his
> > GNUtunesDB.xml.
> >
> > Since it is XML and the encoding is UTF-8 you don't have any problem as
> > long as your system is completely UTF-8 compliant. I however have a
> > mixed iso-8859-1 iso-8859-15 and UTF-8 mess and some of the editors
> > that I like to use are not very smart about handling the character
> > encoding.
> >
> > It would be very easy to convert everything outsite the ascii range to
> > the XML escaped version. So say, instead of some garbage you'd see
> > "ś" where a "Latin Small Letter s with Acute" is.
> >
> > Pro: GNUtunesDB.xml becomes a pure ascii file. No more editor/viewer
> > issues.
> >
> > Contra: The GNUtunesDB.xml becomes slightly bigger and for people with a
> > clean UTF-8 toolchain it becomes a little less readable. (Note: You can
> > still edit the file and insert native UTF-8 as you please.)
> >
> > Any thoughts?
> >
> > cheers
> > -henrik
> >
> >
> >
> > _______________________________________________
> > Bug-gnupod mailing list
> > [email protected]
> > http://lists.nongnu.org/mailman/listinfo/bug-gnupod
> commit 5ce6a9e9173dce95287ff4b15deda67b569dd365
> Author: Heinrich Langos <[EMAIL PROTECTED]>
> Date: Mon Apr 14 19:49:54 2008 +0200
>
> Changed encoding of unicode characters outside of ascii range to XML
> notation.
>
> This change will make your GNUtunesDB.xml into a pure ascii file. Making
> it
> easier to view and manipulate on non-utf8 capable systems.
>
> Note: "xescaped()" is not only called for attribute values but also for
> element names and attribute names. So if sombody comes up with non-ascii
> element names or attribute names we would have to treat those differently.
>
> diff --git a/src/ext/XMLhelper.pm b/src/ext/XMLhelper.pm
> index 5eaeb48..2a230a3 100755
> --- a/src/ext/XMLhelper.pm
> +++ b/src/ext/XMLhelper.pm
> @@ -124,8 +124,15 @@ sub xescaped {
> my $xutf = Unicode::String::utf8($ret)->utf8;
> #Remove 0x00 - 0x1f chars (we don't need them)
> $xutf =~ tr/\000-\037//d;
> -
> - return $xutf;
> + my $out = Unicode::String::utf8("")->utf8;
> + for (my $i = 0 ; $i < Unicode::String::utf8($xutf)->length ; $i++) {
> + if (Unicode::String::utf8($xutf)->substr($i,1)->ord > 127) {
> + $out .= '&#' .
> Unicode::String::utf8($xutf)->substr($i,1)->ord . ';';
> + } else {
> + $out .= Unicode::String::utf8($xutf)->substr($i,1) ;
> + }
> + }
> + return $out;
> }
>
>
> _______________________________________________
> Bug-gnupod mailing list
> [email protected]
> http://lists.nongnu.org/mailman/listinfo/bug-gnupod
commit 1ace27099b20dfc5bb08fe764b30c9c2276729a9
Author: Heinrich Langos <[EMAIL PROTECTED]>
Date: Tue Apr 15 01:15:21 2008 +0200
Improved performance of utf8 to ascii encoding.
diff --git a/src/ext/XMLhelper.pm b/src/ext/XMLhelper.pm
index 2a230a3..b1ab134 100755
--- a/src/ext/XMLhelper.pm
+++ b/src/ext/XMLhelper.pm
@@ -124,12 +124,14 @@ sub xescaped {
my $xutf = Unicode::String::utf8($ret)->utf8;
#Remove 0x00 - 0x1f chars (we don't need them)
$xutf =~ tr/\000-\037//d;
- my $out = Unicode::String::utf8("")->utf8;
- for (my $i = 0 ; $i < Unicode::String::utf8($xutf)->length ; $i++) {
- if (Unicode::String::utf8($xutf)->substr($i,1)->ord > 127) {
- $out .= '&#' . Unicode::String::utf8($xutf)->substr($i,1)->ord . ';';
+ my $u16 = Unicode::String::utf8($xutf);
+ my $out = ""; #pure ascii
+ for (my $i = 0 ; $i < Unicode::String::length($u16); $i++) {
+ my $ccode = Unicode::String::substr($u16,$i,1)->ord;
+ if ($ccode > 127) {
+ $out .= '&#' . $ccode . ';';
} else {
- $out .= Unicode::String::utf8($xutf)->substr($i,1) ;
+ $out .= chr($ccode) ;
}
}
return $out;
_______________________________________________
Bug-gnupod mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/bug-gnupod