RE: Parser passes garbage to characters() callback for XML containing character entities

Michael Glavassevich Wed, 27 Jan 2010 10:22:52 -0800

Thomas,

"Thomas Schleu" <[email protected]> wrote on 01/27/2010 07:47:09 AM:


> Michael,
>
> I know that the body text comes in pieces. That's why I check that the
> accumulated text buffer (sb) is empty when looking at the start of the
> characters.

The code you posted is assuming that the beginning of the first chunk will
start with "abc". There is no such guarantee. The text can be split
anywhere and when I ran your program I observed that for one of the
elements "abc" crosses a buffer boundary so on the first callback you only
get the first two characters: "ab". Your code needs to account for this. I
see no issue with Xerces.

> I also only check when I am inside the "item" element.
> The XML is very simple. It just repeats the same element over and over
> again.
> As I mentioned before the error comes when the XML total size exceeds
16kB
> and occurs when parsing the XML element that is behind the first 8kB.
> I looked at the parser source shortly and noticed that it uses an
internal
> buffer of 8kB. That's why I assume the problem occurs when re-filling the
> buffer while in the middle of or after processing a character entity
> "&#x19;".

I'm not sure what source you're looking at. Xerces' default buffer size is
2 KB. It's been that size for a long time. Are you sure you're actually
using Apache Xerces and not some derivative like what Sun ships in their
JDK?

> Once I removed all those character entities the parser worked as
expected.
>
> Any help you can give?
> Thomas Schleu
> Chief Technology Officer
>
> Mail: mailto:[email protected]
> Fon:  +49-30-390 485 0
> Fax:  +49-30-390 485 55
>
> Canto GmbH
> Alt-Moabit 73
> D-10555 Berlin
> Germany
> http://www.canto.com
> Amtsgericht Berlin-Charlottenburg HRB 88566
> Geschäftsführer: Hans-Dieter Schädel

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

RE: Parser passes garbage to characters() callback for XML containing character entities

Reply via email to