Bug#378411: Buffer overflow in XML::Parser::Expat triggered by utf8

2006-09-05 Thread Steinar H. Gunderson
On Mon, Aug 07, 2006 at 10:53:38AM +0200, Joris van Rantwijk wrote:
 PS. (and slightly off-topic) My personal opinion is that Perl has
 utterly messed up Unicode handling. The documentation uses the terms
 Unicode and UTF8 as if they were interchangable. In fact, and as we
 see with this bug, there is a very important conceptual difference
 between a string containing N raw utf8 bytes and a string containing
 M logical Unicode characters.

This isn't relevant for the bug report, but I think you got it wrong -- Perl
_does_ distinguish between them. In Perl, a scalar is either binary (in which
case it contains raw bytes), or it is text. In the latter case, it is
logically Unicode, but can be stored internally in iso8859-1 or UTF-8 as Perl
sees fit (and is converted transparently between the two). The fact that XS
modules also have to care about this is, of course, another matter :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#378411: Buffer overflow in XML::Parser::Expat triggered by utf8

2006-08-07 Thread Joris van Rantwijk
On Sat, 2006-08-05 at 14:12 -0400, Joey Hess wrote:
 Would just calling Encode::decode_utf8 on the input string in Expat.pm
 be the simplest fix?

I'm not sure, but I think not.
First of all, in the case I reported, the parser reads directly from an
input stream. The data is then not touched by Expat.pm, but handled
internally in Expat.xs.

It seems to me that the reported overflow can not be triggered in the
case where a string (as opposed to a stream) is passed to XML::Parser.
It also seems to me that XML parsing on a string will proceed correctly
regardless of whether the string is logical Unicode or raw UTF8, since
both kinds of strings are essentially the same at the level of Perl
internals.

Secondly, the cause of the reported stream parsing problem is not that
Expat does not handle UTF8 data; it handles that fine. The problem is
that it *expects* raw UTF8 bytes but, in my case, gets logical Unicode
characters instead. It breaks on that because of an invalid assumption
in the buffer management code.

I dived into Expat.xs again and believe I have a simple fix that stops
the buffer management from overflowing the heap. Due to Perl's identical
internal treatment of utf8 and Unicode, this should be all that is
necessary to enable correct parsing of Unicode streams.

My patch is attached.
Basic testing suggests that it works as intended. But I have very little
experience with Perl XS coding, so I would recommend that somebody
reviews this before it is applied anywhere.

Thanks for pushing this forward a bit; we should get it fixed.

Joris.

PS. (and slightly off-topic) My personal opinion is that Perl has
utterly messed up Unicode handling. The documentation uses the terms
Unicode and UTF8 as if they were interchangable. In fact, and as we
see with this bug, there is a very important conceptual difference
between a string containing N raw utf8 bytes and a string containing
M logical Unicode characters.

--- XML-Parser-2.34/Expat/Expat.xs.orig	2003-07-28 16:41:10.0 +0200
+++ XML-Parser-2.34/Expat/Expat.xs	2006-08-07 10:37:40.0 +0200
@@ -289,11 +289,10 @@
   SV *		tbuff;
   SV *		tsiz;
   char *	linebuff;
   STRLEN	lblen;
   STRLEN	br = 0;
-  int		buffsize;
   int		done = 0;
   int		ret = 1;
   char *	msg = NULL;
   CallbackVector * cbv;
   char		*buff = (char *) 0;
@@ -334,37 +333,31 @@
 	   strnEQ(++chk, cbv-delim + 1, cbv-delimlen - 1))
 	lblen -= cbv-delimlen + 1;
 }
 
 PUTBACK ;
-buffsize = lblen;
 done = lblen == 0;
   }
   else {
 tbuff = newSV(0);
 tsiz = newSViv(BUFSIZE);
-buffsize = BUFSIZE;
   }
 
   while (! done)
 {
-  char *buffer = XML_GetBuffer(parser, buffsize);
-
-  if (! buffer)
-	croak(Ran out of memory for input buffer);
+  char *buffer, *tb;
 
   SAVETMPS;
 
   if (cbv-delim) {
-	Copy(linebuff, buffer, lblen, char);
+	tb = linebuff;
 	br = lblen;
 	done = 1;
   }
   else {
 	int cnt;
 	SV * rdres;
-	char * tb;
 
 	PUSHMARK(SP);
 	EXTEND(SP, 3);
 	PUSHs(ioref);
 	PUSHs(tbuff);
@@ -382,18 +375,26 @@
 
 	if (! SvOK(rdres))
 	  croak(read error);
 
 	tb = SvPV(tbuff, br);
-	if (br  0)
-	  Copy(tb, buffer, br, char);
-	else
+	/* br == number of bytes read from stream
+	   Note that it is possible that br  BUFSIZE if the input stream
+	   is decoding a non-ASCII source. */
+	if (br = 0)
 	  done = 1;
 
 	PUTBACK ;
   }
 
+  buffer = XML_GetBuffer(parser, br);
+  if (! buffer)
+	croak(Ran out of memory for input buffer);
+
+  if (br  0)
+Copy(tb, buffer, br, char);
+
   ret = XML_ParseBuffer(parser, br, done);
 
   SPAGAIN; /* resync local SP in case callbacks changed global stack */
 
   if (! ret)


Bug#378411: Buffer overflow in XML::Parser::Expat triggered by utf8

2006-08-05 Thread Joey Hess
I can't speak for the XML::Parser maintainers, but I would like to see a
patch for this, like you proposed creating. My application only missed
sending utf8 data in by luck.

Would just calling Encode::decode_utf8 on the input string in Expat.pm
be the simplest fix?

-- 
see shy jo


signature.asc
Description: Digital signature


Bug#378411: Buffer overflow in XML::Parser::Expat triggered by utf8

2006-07-16 Thread Joris van Rantwijk
Package: libxml-parser-perl
Version: 2.34-4
Severity: grave

A heap overflow can be triggered in the Expat library wrapper
when running on an input stream in non-raw mode. This bug has
also been reported at CPAN:
  http://rt.cpan.org/Ticket/Display.html?id=19859

The following example program will crash with a segmentation fault
on certain input:
--
use strict;
use encoding 'utf8';
use XML::Parser;
my $parser = XML::Parser-new();
$parser-parse(\*STDIN);
--

The following program generates example input on which the above
program crashes:
--
binmode(STDOUT, ':bytes');
print s\n;
for (my $i = 0; $i  4; $i++) { print chr(0xc3) . chr(0xa9); }
print \n/s\n;
--

The overflow occurs in libxml-parser-perl-2.34/Expat/Expat.xs, line 388:
  Copy(tb, buffer, br, char)

At this point, the Expat wrapper assumes that the number of bytes
copied (br), can not exceed the number of characters read from the
input (buffsize). This assumption is incorrect if the input stream is
in a non-raw mode.

The best solution is to have the Perl programmer set the stream
to raw mode, since libexpat expects raw bytes anyway. In the example
program above, this could be accomplished either by removing the
statement use encoding 'utf8' or by adding the statement
binmode(STDIN,':bytes').

I think, however, that a segmentation fault is not a good way
to inform a Perl programmer that he made a mistake. So this
buffer overflow must still be fixed.

Since it involves an input-triggered heap overflow, this is
technically a security vulnerability.

Joris.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]