Michael Glavassevich wrote:
Perhaps this behaviour could be affected by my use of
org.apache.xerces.util.XMLCatalogResolver?
How are you using it?
XMLReader r = factory.newSAXParser().getXMLReader();
r.setEntityResolver(entityResolver);
with catalog.xml containing:
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<!-- US applications use "us-sequence-listing.dtd"
grants use "us-sequence-listing-2004-03-09.dtd"
I've only found the later at the USPTO, so we make the former
refer to the later.
-->
<system systemId="us-sequence-listing.dtd"
uri="dtd/us-sequence-listing-2004-03-09.dtd"/>
<!-- works with apache xerces XMLCatalogResolver -->
<rewriteSystem systemIdStartString="c:\pap\dtds\entities\"
rewritePrefix="dtd/entities/"/>
<rewriteSystem systemIdStartString="c:\pap\dtds\" rewritePrefix="dtd/"/>
<rewriteSystem systemIdStartString=".\entities\"
rewritePrefix="dtd/entities/"/>
<rewriteSystem systemIdStartString=".\" rewritePrefix="dtd/"/>
<rewriteSystem systemIdStartString="" rewritePrefix="dtd/"/>
</catalog>
I'm processing US patent application data
from the USPTO using their DTD's:
* us-patent-application-v41-2005-08-25.dtd
* us-patent-application-v40-2004-12-02.dtd
* us-sequence-listing-2004-03-09.dtd
* pap-v16-2002-01-01.dtd
* pap-v15-2001-01-31.dtd
From a quick perusal these DTDs (including the external entities they
reference) look very large. It's not just the entity declarations. Just
about everything in these DTDs which match the Name production from the
XML spec gets added to the SymbolTable. I assume each document you parse
only references one of them. Perhaps it's the sum of the unique names from
each of the DTDs which leads to your app running out of memory
Yes they are quite large, however I still think there is a problem because:
1) even when using "java -Xmx7000M" (thats 7 salesman's gigabytes) it
falls over (whereas 300Mb is enough if I use a new parser for each doc);
2) profiling shows that symbol table entries exist with a continuously
growing number of different garbage collection generations (new entries
are continuously being added without the old ones being cleaned up). If
the cache was working new entries would not be created once each DTD had
been read once.
Is it possible that I'm messing things up by having xercesImpl-2.8.0 in
the classpath without pointing to it with -Djava.*endorsed*.*dirs?
Cheers,
Neil.
*
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]