Hi Eliot,

I cannot offer an answer to your question; I can just guess it must be
Xerces that triggers the OOM. You could add the following Java options
to your BaseX start script and open the resulting jfr (Java Flight
Recorder) file with Java Mission Control:

-XX:StartFlightRecording,filename=dump.jfc

Feel free to keep us updated,
Christian



On Mon, Feb 6, 2023 at 3:34 AM Eliot Kimber <eliot.kim...@servicenow.com> wrote:
>
> I’ve worked out how to add a Xerces grammar cache to the XML parser. For 
> current code from GitHub, I did this in SAXParser:
>
>
>
> public void parse() throws IOException {
>   final InputSource is = inputSource();
>   final SAXSource saxs = new SAXSource(is);
>   XMLReader reader = null;
>   try {
>     reader = saxs.getXMLReader();
>     if(reader == null) {
>       reader = XmlParser.reader(options.get(MainOptions.DTD), 
> options.get(MainOptions.XINCLUDE));
>     }
>     final EntityResolver er = Resolver.entities(options);
>     if(er != null) reader.setEntityResolver(er);
>
>     saxh = new SAXHandler(builder, options.get(MainOptions.STRIPWS),
>         options.get(MainOptions.STRIPNS));
>     reader.setDTDHandler(saxh);
>     reader.setContentHandler(saxh);
>     reader.setProperty(http://xml.org/sax/properties/lexical-handler, saxh);
>
>     reader.setErrorHandler(saxh);
>     if (true && options.get(MainOptions.DTD)) {
>        XMLGrammarPool pool = getGrammarPool();
>        try {
>          reader.setProperty(FEATURE_GRAMMAR_POOL, pool);
>        } catch (final NoClassDefFoundError e) {
>        } catch (final SAXNotRecognizedException | SAXNotSupportedException e) 
> {
>        }
>     }
>
>     reader.parse(is);
>
>     ...
>
>
>
> Where getGrammarPool() is simply:
>
> static private XMLGrammarPool getGrammarPool() {
>   XMLGrammarPool pool = grammarPool;
>
>   if (pool == null) {
>     pool = new XMLGrammarPoolImpl();
>     grammarPool = pool; //.set(pool);
>   }
>   return pool;
> }
>
>
>
> When I use the grammar cache to parse DITA docs (setting parse DTDs to true 
> and specifying an XML catalog from DITA Open Toolkit) I see the expected 
> speedup.
>
>
>
> For example, on a small set of about 2400 maps and topics, no-DTD parsing 
> takes 7 seconds, grammar-cache parsing takes 20 seconds, and no-grammar-cache 
> DTD parsing takes about 2.5 minutes. So roughly an 8x improvement (which is 
> what I’ve measured using the same grammar cache with Saxon, for example).
>
>
>
> However, using the grammar cache causes some kind of extreme memory leak and 
> I have no idea what it is.
>
>
>
> Without the grammar cache, parsing these topics requires only a few 100 
> metabytes of memory DTD or no, but with the grammar cache, memory usage 
> starts at about 1GB and goes up from there. Parsing my full set of 40K maps 
> and topics, memory grows by 1GB every 30 seconds or so until it eventually 
> exceeds even the 14GB I allocated in my last test. The 1GB could be explained 
> by the cache itself, which holds the parsed grammars.
>
>
>
> Using the debugger, I can see that the grammar cache itself is static once 
> it’s populated with grammars (for my set it ends up loading 10 parsed 
> grammars), so the grammar cache itself doesn’t seem to be the problem.
>
>
>
> I’m trying to use VisualVM to profile the memory but this is not something I 
> have done before and I’m not sure what classes I should be focusing on.
>
>
>
> So my questions:
>
>
>
> Any idea why the simple addition of the grammar cache would cause this kind 
> of memory leak?
> Any guidance on what classes I should focus on to find the culprit?
>
>
>
> My reason for using the grammar cache is to have all the default attributes 
> populated in the database without requiring two hours to load my content 
> (which is what I’ve measured in the past for DTD-aware parsing of my 40K 
> content set).
>
>
>
> Another solution, specific to DITA, would be to use a custom SAX parser that 
> injects the default attributes based on static configuration (for a given set 
> of DITA grammars we know what the defaults will be for every element type and 
> can easily generate the configuration a SAX parser would use). But the 
> current code doesn’t seem to provide an easy way to swap in a custom SAX 
> handler and I’m not really in a position to try to add that level of 
> sophistication to the code.
>
>
>
> Modulo this memory issue, the grammar cache is a nice simple solution to the 
> DTD parsing requirement that is general to any content set that has 
> consistent DTD or XSDs across the set of documents to be parsed.
>
>
>
> Thanks,
>
>
>
> Eliot
>
>
>
> _____________________________________________
>
> Eliot Kimber
>
> Sr Staff Content Engineer
>
> O: 512 554 9368
>
> M: 512 554 9368
>
> servicenow.com
>
> LinkedIn | Twitter | YouTube | Facebook

Reply via email to