More experimentation indicates that the issue is the DTDs--if I load the same 
content without DTD parsing then it loads fine and takes the expected 
relatively small amount of memory.

I think the solution is to turn on Xerces' grammar caching. The only danger 
there is that different DTDs within the same content set can different 
expansions for the same external parameter entity reference (e.g., MathML 
DTDs), which then can lead to validation issues. For this reason the DITA OT 
makes use of the grammar cache switchable but on by default.

Another option for DITA content in particular is to use the OT's preprocessing 
to parse all the docs and then use BaseX with the parsed docs where all the 
attributes have been expanded into the source.

Cheers,

E.
--
Eliot Kimber
http://contrext.com
 

On 5/4/18, 9:52 AM, "Eliot Kimber" <basex-talk-boun...@mailman.uni-konstanz.de 
on behalf of ekim...@contrext.com> wrote:

    Follow up--I tried giving BaseX the full 16GB of RAM and it still 
ultimately locked up with the memory meter showing 13GB.
    
    I'm thinking this must be some kind of memory leak. 
    
    I tried importing the DITA Open Toolkit's documentation source and that 
worked fine with the max memory being about 2.5GB, but it's only about 250 
topics.
    
    Cheers,
    
    E.
    
    --
    Eliot Kimber
    http://contrext.com
     
    On 5/3/18, 4:59 PM, "Eliot Kimber" 
<basex-talk-boun...@mailman.uni-konstanz.de on behalf of ekim...@contrext.com> 
wrote:
    
        In the context of trying to do fun things with DITA docs in BaseX I 
downloaded the latest BaseX (9.0.1) and tried creating a new database and 
loading docs into it using the BaseX GUI. This is on macOS 10.13.4 with 16GB of 
hardware RAM available.
        
        My corpus is about 4000 DITA topics totaling about 30MB on disk. They 
are all in a single directory (not my decision) if that matters.
        
        Using the "parse DTDs" option and default indexing options (no token or 
full text indexes) I'm finding that even with 12GB of RAM allocated to the JVM 
the memory usage during load will eventually go to 12GB, at which point the 
processing appears to stop (that is, whatever I set the max memory to, when 
it's reached, things stop but I only got out of memory errors when I had much 
lower settings, like the default 2GB).
        
        I'm currently running a test with 14GB allocated and it is continuing 
but it does go to 12GB occasionally (watching the memory display on the Add 
progress panel).
        
        No individual file is that big--the biggest is 150K and typical is 30K 
or smaller. 
        
        I wouldn't expect BaseX to have this kind of memory problem so I'm 
wondering if maybe there's an issue with memory on macOS or with DITA documents 
in particular (the DITA DTDs are notoriously large)?
        
        Should I expect BaseX to be able to load this kind of corpus with 14GB 
of RAM?
        
        Cheers,
        
        E.
        --
        Eliot Kimber
        http://contrext.com
         
        
        
        
        
    
    
    
    


Reply via email to