[ http://issues.apache.org/jira/browse/XERCESC-1556?page=all ]
Alberto Massari resolved XERCESC-1556:
--------------------------------------
Resolution: Duplicate
Given your description, this seems to be a duplicate of XERCESC-1542; try
getting the latest version of SchemaInfo.* from SVN and see if the fix for that
bug also applies to you.
Alberto
> Severe performance problem validating with large schema
> -------------------------------------------------------
>
> Key: XERCESC-1556
> URL: http://issues.apache.org/jira/browse/XERCESC-1556
> Project: Xerces-C++
> Type: Bug
> Components: Validating Parser (Schema) (Xerces 1.5 or up only)
> Versions: 2.4.0, 2.7.0
> Environment: HP-UX 11.11 on PA-RISC (HP 9000/800); C++ compiler is aCC
> vA.03.45.
> Reporter: Larry West
> Attachments: xerces-c-gprof-analysis.txt, xerces-c-gprof-analysis.txt,
> xerces-c-gprof-out.zip, xerces-test-pseudocode.txt
>
> (I will try to attach a separate file with the C++ application pseudo-code
> that experiences the performance problem: xerces-test-pseudocode.txt.)
> The problem was observed against both the 2.4.0 and 2.7.0 versions of
> Xerces-C, running in a single-threaded application on an unloaded server.
> The schema we are validating against is huge, but publically available, so
> I'll just provide a URL. There are actually several very similar versions of
> this, named such as "2004v3.0" and "2005v1.2". There are about 536 files in
> the 2005v2.0 version, about 4.75MB, though I don't know how much of that is
> actively in use (a lot of it is, though). All recent version see the same
> performance problems.
> A general page is at: http://www.irs.gov/efile/article/0,,id=128360,00.html
> The schema giving us problems is contained in the Zip file
> efile1120x_2005v2.0.zip, URL=
> http://www.irs.gov/pub/irs-schema/efile1120x_2005v2.0.zip
> When you expand this, the directory structure is, of course, important. The
> "2005v2.0" directory tree contains the top-level schema in question (for the
> 1120 business returns) at:
> 2005v2.0/CorporateIncomeTax/Corp1120/Return1120.xsd
> The data files (business income tax returns) that are validated against this
> can be over a megabyte in size, though I don't know how much that affects the
> time to validate (that is, I assume the time does depend on the size, but I
> haven't measured the relation between the two).
> The problem:
> The problem is that it takes 2-4 hours to validate schema on a fairly
> high-performing platform. For comparison, using Xerces-J v2.7.1 to do the
> same validation normally takes under a minute (though four times the memory).
> I believe I have identified the areas causing the problem, which are repeated
> sequential lookups through lists that have 2000+ elements. And in most
> cases, my testing shows that there is never a match to any of these lookups.
> I was planning on introducing a hash-map to cache the results of the first
> lookup, but using Xerces-J turned out to be a more practical approach in my
> case.
> So, what follows are my notes from the debugging and performance
> instrumentation I've done.
> Apparently key point: the "higher-level" (4-param)
> SchemaInfo::getTopLevelComponent() is called 4920 times, but calls the
> "lower-level" (3-param) one 1.78M times because (here's pseudo code for the
> 4-param version):
> //== get here 4920 times
> DOMElement* child = getTopLevelComponent(compCategory, compName, name);
> if ( child == 0)
> { //== get here 4159 times
> listSize = fIncludeInfoList->size();
> //== listSize always 427 --> number of include files
> for ( i = 0 ; i < listSize ; ++i ) {
> SchemaInfo *ptr = fIncludeInfoList[i];
> child = ptr->getTopLevelComponent(compCategory, compName, name);
> //== the above NEVER succeeds. It's called 4159*427 (1.78M)
> times.
> }
> }
> Part of my investigation involved using gprof; I will try to attach my
> conclusions from that as a separate attachment
> ("xerces-c-gprof-analysis.txt"), and the gprof output (which is large, hence
> zipped) as a 2nd attachment, "xerces-c-gprof-out.zip".
> Other notes:
> From casual observation, it appears that very little of the time is spent
> doing I/O. It appears that the Schema (all its files) are read in once.
> I'm not sure though, whether that happens very quickly at the beginning, or
> whether it's spread out over the 2 hour run.
> Also, the memory usage rises up to about 64MB reasonably early in the process
> (matter of minutes), then stays flat... which also suggests to me that it has
> finished parsing the schema files early on. [As I stated earlier, Xerces-J
> takes under a minute to do this. It grows to ~256MB early on and stays flat
> after that.]
> If a sample data file is needed for investigation, let me know and I'll get
> one.
> Larry West
> Intuit, Inc
> Consumer Tax Group
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]