[ 
https://issues.apache.org/jira/browse/XERCESJ-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730172#action_12730172
 ] 

Richard Kelly commented on XERCESJ-1382:
----------------------------------------

I took a look at this today.

My initial results were less than 200ms for the problem xml with the default 
settings, but over 30 seconds if I set the "cdata-sections" parameter to true 
before parsing.

I first tried replacing the String concatenation in the 
CharacterDataImpl.appendData() method with a StringBuffer.  This reduced the 
parse time to around 10 seconds but that still seemed unacceptably slow.

After profiling, I noticed that the problem appears to be that appendData() is 
just being called far too many times due to the number of line breaks in the 
xerces_performance_problem.xml file.

The scanData() method in org.apache.xerces.impl.XMLEntityScanner divides up 
character data into each line rather than filling up the
whole buffer.  This results in very small chunks of character data and means 
that appendData() gets called too often.

I tried instructing the scanData() method to ignore line breaks in character 
data, by commenting out these four lines:

           else if (c == '\n' || (external && c == '\r')) {
               fCurrentEntity.position--;
               break;
           }

This results in the full buffer being used and fixed the performance problem 
(the problem xml file parsed in well under 1 second).  Unfortunately removing 
this code has the side effect of breaking some other things (e.g. the Locator), 
but perhaps someone more familiar with the XMLEntityScanner can suggest a way 
to fix this.

In the meantime, a temporary workaround would be to remove line breaks from 
your CDATA in your problematic xml files before parsing.

> Performance problem in org.apache.xerces.dom.CharacterDataImpl appendData
> -------------------------------------------------------------------------
>
>                 Key: XERCESJ-1382
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1382
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DOM (Level 3 Core)
>    Affects Versions: 2.6.0, 2.9.1
>         Environment: Windows XP SP2; JRE 1.6.0_13; Xerces2 Java Parser 2.9.1 
> Release (Xerces-J-bin.2.9.1.zip)
>            Reporter: Bene
>            Priority: Critical
>         Attachments: xerces_performance_problem.png, 
> xerces_performance_problem.xml
>
>
> It takes too long to parse a large XML Document, if the document contains 
> CDATA sections, which contain embedded XML.
> The problem initially occured with Xerces 2.6.0, where it took about 30 
> seconds !!! to parse an XML document with about 250 KB.
> So we upgraded to Xerces 2.9.1, which improves parse time to about 5 seconds. 
> Unfortunately this is still much too slow!
> I tried to find similar bug reports and there are many:
> XERCESJ-102
> XERCESJ-1268
> XALANJ-2398
> Unfortunately the issue is still not fixed, so I decided to create this 
> report.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to