Severe performance problem validating with large schema
-------------------------------------------------------
Key: XERCESC-1556
URL: http://issues.apache.org/jira/browse/XERCESC-1556
Project: Xerces-C++
Type: Bug
Components: Validating Parser (Schema) (Xerces 1.5 or up only)
Versions: 2.4.0, 2.7.0
Environment: HP-UX 11.11 on PA-RISC (HP 9000/800); C++ compiler is aCC
vA.03.45.
Reporter: Larry West
(I will try to attach a separate file with the C++ application pseudo-code that
experiences the performance problem: xerces-test-pseudocode.txt.)
The problem was observed against both the 2.4.0 and 2.7.0 versions of Xerces-C,
running in a single-threaded application on an unloaded server.
The schema we are validating against is huge, but publically available, so I'll
just provide a URL. There are actually several very similar versions of this,
named such as "2004v3.0" and "2005v1.2". There are about 536 files in the
2005v2.0 version, about 4.75MB, though I don't know how much of that is
actively in use (a lot of it is, though). All recent version see the same
performance problems.
A general page is at: http://www.irs.gov/efile/article/0,,id=128360,00.html
The schema giving us problems is contained in the Zip file
efile1120x_2005v2.0.zip, URL=
http://www.irs.gov/pub/irs-schema/efile1120x_2005v2.0.zip
When you expand this, the directory structure is, of course, important. The
"2005v2.0" directory tree contains the top-level schema in question (for the
1120 business returns) at:
2005v2.0/CorporateIncomeTax/Corp1120/Return1120.xsd
The data files (business income tax returns) that are validated against this
can be over a megabyte in size, though I don't know how much that affects the
time to validate (that is, I assume the time does depend on the size, but I
haven't measured the relation between the two).
The problem:
The problem is that it takes 2-4 hours to validate schema on a fairly
high-performing platform. For comparison, using Xerces-J v2.7.1 to do the same
validation normally takes under a minute (though four times the memory).
I believe I have identified the areas causing the problem, which are repeated
sequential lookups through lists that have 2000+ elements. And in most cases,
my testing shows that there is never a match to any of these lookups. I was
planning on introducing a hash-map to cache the results of the first lookup,
but using Xerces-J turned out to be a more practical approach in my case.
So, what follows are my notes from the debugging and performance
instrumentation I've done.
Apparently key point: the "higher-level" (4-param)
SchemaInfo::getTopLevelComponent() is called 4920 times, but calls the
"lower-level" (3-param) one 1.78M times because (here's pseudo code for the
4-param version):
//== get here 4920 times
DOMElement* child = getTopLevelComponent(compCategory, compName, name);
if ( child == 0)
{ //== get here 4159 times
listSize = fIncludeInfoList->size();
//== listSize always 427 --> number of include files
for ( i = 0 ; i < listSize ; ++i ) {
SchemaInfo *ptr = fIncludeInfoList[i];
child = ptr->getTopLevelComponent(compCategory, compName, name);
//== the above NEVER succeeds. It's called 4159*427 (1.78M) times.
}
}
Part of my investigation involved using gprof; I will try to attach my
conclusions from that as a separate attachment ("xerces-c-gprof-analysis.txt"),
and the gprof output (which is large, hence zipped) as a 2nd attachment,
"xerces-c-gprof-out.zip".
Other notes:
>From casual observation, it appears that very little of the time is spent
>doing I/O. It appears that the Schema (all its files) are read in once. I'm
>not sure though, whether that happens very quickly at the beginning, or
>whether it's spread out over the 2 hour run.
Also, the memory usage rises up to about 64MB reasonably early in the process
(matter of minutes), then stays flat... which also suggests to me that it has
finished parsing the schema files early on. [As I stated earlier, Xerces-J
takes under a minute to do this. It grows to ~256MB early on and stays flat
after that.]
If a sample data file is needed for investigation, let me know and I'll get one.
Larry West
Intuit, Inc
Consumer Tax Group
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]