Hi Michael,
Thanks for your reply.
Michael Glavassevich wrote:
Hi Neil,
There was a related discussion [1][2] about the SymbolTable on this list
back in March 2005.
Thanks - yes I did come across that thread before posting. Although
closely related, I don't think its the same issue because that is about
running out of memory parsing a single document and my issue is
specifically with reusing the same parser to parse many documents (using
a limited set of DTDs). I don't have a problem if I get a new parser for
each document.
Could the parser be keeping the symbol table from previous documents but
not reusing it when it comes across the same DTD in a new document?
Perhaps this behaviour could be affected by my use of
org.apache.xerces.util.XMLCatalogResolver?
Do these large documents contain similar names or do
they contain many unique names. Specifically do your documents look like
this?
Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc>
...
Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/>
<elem100000-n/></doc>
No the data is not like that. There are a decent number of element names
as well as some heavily reused elements. The DTD's contain more than
2000 entity declarations. I'm processing US patent application data
from the USPTO using their DTD's:
* us-patent-application-v41-2005-08-25.dtd
* us-patent-application-v40-2004-12-02.dtd
* us-sequence-listing-2004-03-09.dtd
* pap-v16-2002-01-01.dtd
* pap-v15-2001-01-31.dtd
Cheers,
Neil Bacon
Cambia
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]