RE: Parser passes garbage to characters() callback for XML containing character entities

Thomas Schleu Wed, 27 Jan 2010 05:21:04 -0800

Michael,

I know that the body text comes in pieces. That's why I check that the
accumulated text buffer (sb) is empty when looking at the start of the
characters.
I also only check when I am inside the "item" element.
The XML is very simple. It just repeats the same element over and over
again.
As I mentioned before the error comes when the XML total size exceeds 16kB
and occurs when parsing the XML element that is behind the first 8kB.
I looked at the parser source shortly and noticed that it uses an internal
buffer of 8kB. That's why I assume the problem occurs when re-filling the
buffer while in the middle of or after processing a character entity
"&#x19;".
Once I removed all those character entities the parser worked as expected.


Any help you can give?
Thomas Schleu
Chief Technology Officer

Mail: mailto:tsch...@canto.com
Fon:  +49-30-390 485 0
Fax:  +49-30-390 485 55

Canto GmbH
Alt-Moabit 73
D-10555 Berlin
Germany
http://www.canto.com
Amtsgericht Berlin-Charlottenburg HRB 88566
Geschäftsführer: Hans-Dieter Schädel


> -----Original Message-----
> From: Gary Gregory [mailto:ggreg...@seagullsoftware.com]
> Sent: Freitag, 22. Januar 2010 19:57
> To: j-users@xerces.apache.org; tsch...@canto.com
> Subject: RE: Parser passes garbage to characters() callback for XML
> containing character entities
> 
> For Xerces 2.9.1, did you add Xerces to your runtime through the Java
> endorsed mechanism [1]?
> 
> Gary
> 
> [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/
> 
> 
> > -----Original Message-----
> > From: Thomas Schleu [mailto:tsch...@canto.com]
> > Sent: Friday, January 22, 2010 05:29
> > To: j-users@xerces.apache.org
> > Subject: Parser passes garbage to characters() callback for XML
> > containing character entities
> >
> > I can reproduce a problem parsing certain XML 1.1 files that contain
> > lots of
> > character entities (escaped control chars like "&#x19;").
> > At some point in the file the parser calls my characters() method
> with
> > garbage text.
> >
> > Here is the source code that generates such an XML file:
> >
> >     FileOutputStream fos = new FileOutputStream (new File
> > ("C:/test.xml"));
> >     fos.write ("<?xml version=\"1.1\" encoding=\"UTF-8\"?>\n<!DOCTYPE
> > X>\n<ns:X xmlns:ns=\"http://www.mycompany.com/ns/X/1.0\";>\n".getBytes
> > ("UTF-8"));
> >     final byte[] bytes =
> > ("<ns:item>abcdefghijklmnopqrstuvwxyz&#x19;</ns:item>\n").getBytes
> > ("UTF-8");
> >     for (int i = 0; i < 314; i++)
> >     {
> >         fos.write(bytes);
> >     }
> >     fos.write ("</ns:X>".getBytes ("UTF-8"));
> >     fos.close ();
> >
> > The XML is very simple, it just  contains lots of identical elements
> > with
> > "&#x19;" in the body text.
> > The parsing code looks like the following:
> >
> >     FileInputStream fis = new FileInputStream (new File
> > ("C:/test.xml"));
> >     final SAXParserFactory saxParserFactory =
> > SAXParserFactory.newInstance
> > ();
> >     saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespaces";,
> > Boolean.TRUE);
> >     saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE);
> >     final SAXParser parser = saxParserFactory.newSAXParser ();
> >     try
> >     {
> >         parser.parse (fis, new DefaultHandler()
> >         {
> >             StringBuilder sb = new StringBuilder ();
> >             String currentElement = null;
> >
> >             public void startElement (String uri, String localName,
> > String
> > qName, Attributes attributes) throws SAXException
> >             {
> >                 currentElement = localName;
> >             }
> >             public void characters (char ch[], int start, int length)
> > throws
> > SAXException
> >             {
> >                 if ("item".equals (currentElement))
> >                 {
> >                     String s = new String (ch, start, length);
> >                     if (sb.length () == 0 && !s.startsWith ("abc"))
> >                     {
> >                         // THE PARSER CALLS ME WITH GARBAGE!
> >                         System.out.println ("ERROR");
> >                     }
> >                     sb.append (s);
> >                 }
> >             }
> >             public void endElement (String uri, String localName,
> > String
> > qName) throws SAXException
> >             {
> >                 if ("item".equals (localName))
> >                 {
> >                     sb.delete (0, sb.length ());
> >                     currentElement = null;
> >                 }
> >             }
> >         });
> >     }
> >     catch (Exception e)
> >     {
> >         e.printStackTrace ();
> >         System.out.println ("e = " + e);
> >     }
> >
> > My characters() method checks whether the body text is the expected
> > text
> > starting with "abc".
> > After 156 elements with the correct body text my method gets called
> > with the
> > text "x19;<fghijklmnopqrstuvwxyz" as the starting body text of the
> > element.
> > The XML code has to exceed 16kB to show this problem. It may be
> related
> > to
> > the 8kB internal buffer of the parser.
> >
> > I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14
> as
> > well
> > as with the separate xerces-2_9_1. All show the same behavior.
> > I cannot work around this as I don't have control over the XML input.
> >
> > Anyone who can help me here?
> >
> > Thanks in Advance
> > Thomas Schleu
> > Chief Technology Officer
> >
> > Mail: mailto:tsch...@canto.com
> > Fon:  +49-30-390 485 0
> > Fax:  +49-30-390 485 55
> >
> > Canto GmbH
> > Alt-Moabit 73
> > D-10555 Berlin
> > Germany
> > http://www.canto.com
> > Amtsgericht Berlin-Charlottenburg HRB 88566
> > Geschäftsführer: Hans-Dieter Schädel
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> > For additional commands, e-mail: j-users-h...@xerces.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org

RE: Parser passes garbage to characters() callback for XML containing character entities

Reply via email to