More experimentation indicates that the issue is the DTDs--if I load the same content without DTD parsing then it loads fine and takes the expected relatively small amount of memory.
I think the solution is to turn on Xerces' grammar caching. The only danger there is that different DTDs within the same content set can different expansions for the same external parameter entity reference (e.g., MathML DTDs), which then can lead to validation issues. For this reason the DITA OT makes use of the grammar cache switchable but on by default. Another option for DITA content in particular is to use the OT's preprocessing to parse all the docs and then use BaseX with the parsed docs where all the attributes have been expanded into the source. Cheers, E. -- Eliot Kimber http://contrext.com On 5/4/18, 9:52 AM, "Eliot Kimber" <basex-talk-boun...@mailman.uni-konstanz.de on behalf of ekim...@contrext.com> wrote: Follow up--I tried giving BaseX the full 16GB of RAM and it still ultimately locked up with the memory meter showing 13GB. I'm thinking this must be some kind of memory leak. I tried importing the DITA Open Toolkit's documentation source and that worked fine with the max memory being about 2.5GB, but it's only about 250 topics. Cheers, E. -- Eliot Kimber http://contrext.com On 5/3/18, 4:59 PM, "Eliot Kimber" <basex-talk-boun...@mailman.uni-konstanz.de on behalf of ekim...@contrext.com> wrote: In the context of trying to do fun things with DITA docs in BaseX I downloaded the latest BaseX (9.0.1) and tried creating a new database and loading docs into it using the BaseX GUI. This is on macOS 10.13.4 with 16GB of hardware RAM available. My corpus is about 4000 DITA topics totaling about 30MB on disk. They are all in a single directory (not my decision) if that matters. Using the "parse DTDs" option and default indexing options (no token or full text indexes) I'm finding that even with 12GB of RAM allocated to the JVM the memory usage during load will eventually go to 12GB, at which point the processing appears to stop (that is, whatever I set the max memory to, when it's reached, things stop but I only got out of memory errors when I had much lower settings, like the default 2GB). I'm currently running a test with 14GB allocated and it is continuing but it does go to 12GB occasionally (watching the memory display on the Add progress panel). No individual file is that big--the biggest is 150K and typical is 30K or smaller. I wouldn't expect BaseX to have this kind of memory problem so I'm wondering if maybe there's an issue with memory on macOS or with DITA documents in particular (the DITA DTDs are notoriously large)? Should I expect BaseX to be able to load this kind of corpus with 14GB of RAM? Cheers, E. -- Eliot Kimber http://contrext.com