[jira] Created: (XERCESC-1556) Severe performance problem validating with large schema

Larry West (JIRA) Mon, 23 Jan 2006 18:44:57 -0800

Severe performance problem validating with large schema
-------------------------------------------------------


         Key: XERCESC-1556
         URL: http://issues.apache.org/jira/browse/XERCESC-1556
     Project: Xerces-C++
        Type: Bug
  Components: Validating Parser (Schema) (Xerces 1.5 or up only)  
    Versions: 2.4.0, 2.7.0    
 Environment: HP-UX 11.11 on PA-RISC (HP 9000/800); C++ compiler is aCC 
vA.03.45.
    Reporter: Larry West


(I will try to attach a separate file with the C++ application pseudo-code that 
experiences the performance problem: xerces-test-pseudocode.txt.)


The problem was observed against both the 2.4.0 and 2.7.0 versions of Xerces-C, 
running in a single-threaded application on an unloaded server.

The schema we are validating against is huge, but publically available, so I'll 
just provide a URL.  There are actually several very similar versions of this, 
named such as "2004v3.0" and "2005v1.2".   There are about 536 files in the 
2005v2.0 version, about 4.75MB, though I don't know how much of that is 
actively in use (a lot of it is, though).  All recent version see the same 
performance problems.

A general page is at: http://www.irs.gov/efile/article/0,,id=128360,00.html

The schema giving us problems is contained in the Zip file 
efile1120x_2005v2.0.zip, URL=
        http://www.irs.gov/pub/irs-schema/efile1120x_2005v2.0.zip
When you expand this, the directory structure is, of course, important.  The 
"2005v2.0" directory tree contains the top-level schema in question (for the 
1120 business returns) at:
        2005v2.0/CorporateIncomeTax/Corp1120/Return1120.xsd

The data files (business income tax returns) that are validated against this 
can be over a megabyte in size, though I don't know how much that affects the 
time to validate (that is, I assume the time does depend on the size, but I 
haven't measured the relation between the two).

The problem:

The problem is that it takes 2-4 hours to validate schema on a fairly 
high-performing platform.  For comparison, using Xerces-J v2.7.1 to do the same 
validation normally takes under a minute (though four times the memory).


I believe I have identified the areas causing the problem, which are repeated 
sequential lookups through lists that have 2000+ elements.  And in most cases, 
my testing shows that there is never a match to any of these lookups.   I was 
planning on introducing a hash-map to cache the results of the first lookup, 
but using Xerces-J turned out to be a more practical approach in my case.

So, what follows are my notes from the debugging and performance 
instrumentation I've done.


Apparently key point: the "higher-level" (4-param) 
SchemaInfo::getTopLevelComponent() is called 4920 times, but calls the 
"lower-level" (3-param) one 1.78M times because (here's pseudo code for the 
4-param version):

    //== get here 4920 times 
    DOMElement* child = getTopLevelComponent(compCategory, compName, name);
    if ( child == 0) 
    {   //== get here 4159 times
        listSize = fIncludeInfoList->size();
        //== listSize always 427 --> number of include files
        for ( i = 0 ; i < listSize ; ++i ) {
            SchemaInfo *ptr = fIncludeInfoList[i];
            child = ptr->getTopLevelComponent(compCategory, compName, name);
            //== the above NEVER succeeds.  It's called 4159*427 (1.78M) times.
        }
    }

Part of my investigation involved using gprof; I will try to attach my 
conclusions from that as a separate attachment ("xerces-c-gprof-analysis.txt"), 
and the gprof output (which is large, hence zipped) as a 2nd attachment, 
"xerces-c-gprof-out.zip".


Other notes:

>From casual observation, it appears that very little of the time is spent 
>doing I/O.  It appears that the Schema (all its files) are read in once.   I'm 
>not sure though, whether that happens very quickly at the beginning, or 
>whether it's spread out over the 2 hour run.

Also, the memory usage rises up to about 64MB reasonably early in the process 
(matter of minutes), then stays flat... which also suggests to me that it has 
finished parsing the schema files early on.  [As I stated earlier, Xerces-J 
takes under a minute to do this.  It grows to ~256MB early on and stays flat 
after that.] 


If a sample data file is needed for investigation, let me know and I'll get one.


Larry West
Intuit, Inc
Consumer Tax Group


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (XERCESC-1556) Severe performance problem validating with large schema

Reply via email to