Hi Neil,

Neil Bacon <[EMAIL PROTECTED]> wrote on 08/28/2006 07:37:56 PM:

> Hi Michael,
> Thanks for your reply.
> 
> Michael Glavassevich wrote:
> > Hi Neil,
> >
> > There was a related discussion [1][2] about the SymbolTable on this 
list 
> > back in March 2005.
> Thanks - yes I did come across that thread before posting. Although 
> closely related, I don't think its the same issue because that is about 
> running out of memory parsing a single document and my issue is 
> specifically with reusing the same parser to parse many documents (using 

> a limited set of DTDs). I don't have a problem if I get a new parser for 

> each document.

Whether it's one large document with a million different names or a 
thousand documents with those million names distributed across them it has 
the same effect. The parser's SymbolTable will have all of the names in 
its cache. If this is what's happening you can write an extension to the 
SymbolTable which uses less memory (possibly one which doesn't cache at 
all) and set it on the parser.

> Could the parser be keeping the symbol table from previous documents but 

> not reusing it when it comes across the same DTD in a new document?

A parser instance only has one SymbolTable. The one it has will only be 
replaced if you explicitly replace it by setting a different SymbolTable 
on the parser.

> Perhaps this behaviour could be affected by my use of 
> org.apache.xerces.util.XMLCatalogResolver?

How are you using it?

> > Do these large documents contain similar names or do 
> > they contain many unique names. Specifically do your documents look 
like 
> > this? 
> >
> > Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc> 
> > ... 
> > Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/> 
> > <elem100000-n/></doc>
> > 
> No the data is not like that. There are a decent number of element names 

> as well as some heavily reused elements. The DTD's contain more than 
> 2000 entity declarations.  I'm processing US patent application data 
> from the USPTO using their DTD's:
> 
>     * us-patent-application-v41-2005-08-25.dtd
>     * us-patent-application-v40-2004-12-02.dtd
>     * us-sequence-listing-2004-03-09.dtd
>     * pap-v16-2002-01-01.dtd
>     * pap-v15-2001-01-31.dtd

>From a quick perusal these DTDs (including the external entities they 
reference) look very large. It's not just the entity declarations. Just 
about everything in these DTDs which match the Name production from the 
XML spec gets added to the SymbolTable. I assume each document you parse 
only references one of them. Perhaps it's the sum of the unique names from 
each of the DTDs which leads to your app running out of memory.

> Cheers,
>     Neil Bacon
>     Cambia
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to