Hi Neil, Neil Bacon <[EMAIL PROTECTED]> wrote on 08/28/2006 07:37:56 PM:
> Hi Michael, > Thanks for your reply. > > Michael Glavassevich wrote: > > Hi Neil, > > > > There was a related discussion [1][2] about the SymbolTable on this list > > back in March 2005. > Thanks - yes I did come across that thread before posting. Although > closely related, I don't think its the same issue because that is about > running out of memory parsing a single document and my issue is > specifically with reusing the same parser to parse many documents (using > a limited set of DTDs). I don't have a problem if I get a new parser for > each document. Whether it's one large document with a million different names or a thousand documents with those million names distributed across them it has the same effect. The parser's SymbolTable will have all of the names in its cache. If this is what's happening you can write an extension to the SymbolTable which uses less memory (possibly one which doesn't cache at all) and set it on the parser. > Could the parser be keeping the symbol table from previous documents but > not reusing it when it comes across the same DTD in a new document? A parser instance only has one SymbolTable. The one it has will only be replaced if you explicitly replace it by setting a different SymbolTable on the parser. > Perhaps this behaviour could be affected by my use of > org.apache.xerces.util.XMLCatalogResolver? How are you using it? > > Do these large documents contain similar names or do > > they contain many unique names. Specifically do your documents look like > > this? > > > > Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc> > > ... > > Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/> > > <elem100000-n/></doc> > > > No the data is not like that. There are a decent number of element names > as well as some heavily reused elements. The DTD's contain more than > 2000 entity declarations. I'm processing US patent application data > from the USPTO using their DTD's: > > * us-patent-application-v41-2005-08-25.dtd > * us-patent-application-v40-2004-12-02.dtd > * us-sequence-listing-2004-03-09.dtd > * pap-v16-2002-01-01.dtd > * pap-v15-2001-01-31.dtd >From a quick perusal these DTDs (including the external entities they reference) look very large. It's not just the entity declarations. Just about everything in these DTDs which match the Name production from the XML spec gets added to the SymbolTable. I assume each document you parse only references one of them. Perhaps it's the sum of the unique names from each of the DTDs which leads to your app running out of memory. > Cheers, > Neil Bacon > Cambia > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [EMAIL PROTECTED] E-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
