[ http://issues.apache.org/jira/browse/XERCESC-1556?page=all ]
     
Alberto Massari resolved XERCESC-1556:
--------------------------------------

    Resolution: Duplicate

Given your description, this seems to be a duplicate of XERCESC-1542; try 
getting the latest version of SchemaInfo.* from SVN and see if the fix for that 
bug also applies to you.

Alberto

> Severe performance problem validating with large schema
> -------------------------------------------------------
>
>          Key: XERCESC-1556
>          URL: http://issues.apache.org/jira/browse/XERCESC-1556
>      Project: Xerces-C++
>         Type: Bug
>   Components: Validating Parser (Schema) (Xerces 1.5 or up only)
>     Versions: 2.4.0, 2.7.0
>  Environment: HP-UX 11.11 on PA-RISC (HP 9000/800); C++ compiler is aCC 
> vA.03.45.
>     Reporter: Larry West
>  Attachments: xerces-c-gprof-analysis.txt, xerces-c-gprof-analysis.txt, 
> xerces-c-gprof-out.zip, xerces-test-pseudocode.txt
>
> (I will try to attach a separate file with the C++ application pseudo-code 
> that experiences the performance problem: xerces-test-pseudocode.txt.)
> The problem was observed against both the 2.4.0 and 2.7.0 versions of 
> Xerces-C, running in a single-threaded application on an unloaded server.
> The schema we are validating against is huge, but publically available, so 
> I'll just provide a URL.  There are actually several very similar versions of 
> this, named such as "2004v3.0" and "2005v1.2".   There are about 536 files in 
> the 2005v2.0 version, about 4.75MB, though I don't know how much of that is 
> actively in use (a lot of it is, though).  All recent version see the same 
> performance problems.
> A general page is at: http://www.irs.gov/efile/article/0,,id=128360,00.html
> The schema giving us problems is contained in the Zip file 
> efile1120x_2005v2.0.zip, URL=
>       http://www.irs.gov/pub/irs-schema/efile1120x_2005v2.0.zip
> When you expand this, the directory structure is, of course, important.  The 
> "2005v2.0" directory tree contains the top-level schema in question (for the 
> 1120 business returns) at:
>       2005v2.0/CorporateIncomeTax/Corp1120/Return1120.xsd
> The data files (business income tax returns) that are validated against this 
> can be over a megabyte in size, though I don't know how much that affects the 
> time to validate (that is, I assume the time does depend on the size, but I 
> haven't measured the relation between the two).
> The problem:
> The problem is that it takes 2-4 hours to validate schema on a fairly 
> high-performing platform.  For comparison, using Xerces-J v2.7.1 to do the 
> same validation normally takes under a minute (though four times the memory).
> I believe I have identified the areas causing the problem, which are repeated 
> sequential lookups through lists that have 2000+ elements.  And in most 
> cases, my testing shows that there is never a match to any of these lookups.  
>  I was planning on introducing a hash-map to cache the results of the first 
> lookup, but using Xerces-J turned out to be a more practical approach in my 
> case.
> So, what follows are my notes from the debugging and performance 
> instrumentation I've done.
> Apparently key point: the "higher-level" (4-param) 
> SchemaInfo::getTopLevelComponent() is called 4920 times, but calls the 
> "lower-level" (3-param) one 1.78M times because (here's pseudo code for the 
> 4-param version):
>     //== get here 4920 times 
>     DOMElement* child = getTopLevelComponent(compCategory, compName, name);
>     if ( child == 0) 
>     {   //== get here 4159 times
>         listSize = fIncludeInfoList->size();
>         //== listSize always 427 --> number of include files
>         for ( i = 0 ; i < listSize ; ++i ) {
>             SchemaInfo *ptr = fIncludeInfoList[i];
>             child = ptr->getTopLevelComponent(compCategory, compName, name);
>             //== the above NEVER succeeds.  It's called 4159*427 (1.78M) 
> times.
>         }
>     }
> Part of my investigation involved using gprof; I will try to attach my 
> conclusions from that as a separate attachment 
> ("xerces-c-gprof-analysis.txt"), and the gprof output (which is large, hence 
> zipped) as a 2nd attachment, "xerces-c-gprof-out.zip".
> Other notes:
> From casual observation, it appears that very little of the time is spent 
> doing I/O.  It appears that the Schema (all its files) are read in once.   
> I'm not sure though, whether that happens very quickly at the beginning, or 
> whether it's spread out over the 2 hour run.
> Also, the memory usage rises up to about 64MB reasonably early in the process 
> (matter of minutes), then stays flat... which also suggests to me that it has 
> finished parsing the schema files early on.  [As I stated earlier, Xerces-J 
> takes under a minute to do this.  It grows to ~256MB early on and stays flat 
> after that.] 
> If a sample data file is needed for investigation, let me know and I'll get 
> one.
> Larry West
> Intuit, Inc
> Consumer Tax Group

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to