Hi Michael,
Thanks for your reply.

Michael Glavassevich wrote:
Hi Neil,

There was a related discussion [1][2] about the SymbolTable on this list back in March 2005.
Thanks - yes I did come across that thread before posting. Although closely related, I don't think its the same issue because that is about running out of memory parsing a single document and my issue is specifically with reusing the same parser to parse many documents (using a limited set of DTDs). I don't have a problem if I get a new parser for each document.

Could the parser be keeping the symbol table from previous documents but not reusing it when it comes across the same DTD in a new document? Perhaps this behaviour could be affected by my use of org.apache.xerces.util.XMLCatalogResolver?
Do these large documents contain similar names or do they contain many unique names. Specifically do your documents look like this? Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc> ... Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/> <elem100000-n/></doc>
No the data is not like that. There are a decent number of element names as well as some heavily reused elements. The DTD's contain more than 2000 entity declarations. I'm processing US patent application data from the USPTO using their DTD's:

   * us-patent-application-v41-2005-08-25.dtd
   * us-patent-application-v40-2004-12-02.dtd
   * us-sequence-listing-2004-03-09.dtd
   * pap-v16-2002-01-01.dtd
   * pap-v15-2001-01-31.dtd

Cheers,
   Neil Bacon
   Cambia

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to