I tried again, using SPLITSIZE = 12 in the .basex config file
The batch(console) script I used is attached mass-import.xq
This time I didn't do the optimize or index creation post-import, but
instead, I did it as part of the import similar to what
is described in [4].
This time I got a different error, that is, "org.basex.core.BaseXException:
Out of Main Memory.*"*
So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
difference here?


[4] http://docs.basex.org/wiki/Indexes#Performance

On Wed, Oct 2, 2019 at 11:06 AM first name last name <randomcod...@gmail.com>

> Hey Christian,
> Thank you for your answer :)
> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same OOM
> behavior. It looks like the memory consumption is moderate until when it
> reaches about 30GB (the size of the db before optimize) and
> then memory consumption spikes, and OOM occurs. Now I'm trying with
> SPLITSIZE = 1000 and will report back if I get OOM again.
> Regarding what you said, it might be that the merge step is where the OOM
> occurs (I wonder if there's any way to control how much memory is being
> used inside the merge step).
> To quote the statistics page from the wiki:
>     Databases <http://docs.basex.org/wiki/Databases> in BaseX are
> light-weight. If a database limit is reached, you can distribute your
> documents across multiple database instances and access all of them with a
> single XQuery expression.
> This to me sounds like sharding. I would probably be able to split the
> documents into chunks and upload them under a db with the same prefix, but
> varying suffix.. seems a lot like shards. By doing this
> I think I can avoid OOM, but if BaseX provides other, better, maybe native
> mechanisms of avoiding OOM, I would try them.
> Best regards,
> Stefan
> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün <christian.gr...@gmail.com>
> wrote:
>> Hi first name,
>> If you optimize your database, the indexes will be rebuilt. In this
>> step, the builder tries to guess how much free memory is still
>> available. If memory is exhausted, parts of the index will be split
>> (i. e., partially written to disk) and merged in a final step.
>> However, you can circumvent the heuristics by manually assigning a
>> static split value; see [1] for more information. If you use the DBA,
>> you’ll need to assign this value to your .basex or the web.xml file
>> [2]. In order to find the best value for your setup, it may be easier
>> to play around with the BaseX GUI.
>> As you have already seen in our statistics, an XML document has
>> various properties that may represent a limit for a single database.
>> Accordingly, these properties make it difficult to decide for the
>> system when the memory will be exhausted during an import or index
>> rebuild.
>> In general, you’ll get best performance (and your memory consumption
>> will be lower) if you create your database and specify the data to be
>> imported in a single run. This is currently not possible via the DBA;
>> use the GUI (Create Database) or console mode (CREATE DB command)
>> instead.
>> Hope this helps,
>> Christian
>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>> [2] http://docs.basex.org/wiki/Configuration
>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>> <randomcod...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Let's say there's a 30GB dataset [3] containing most threads/posts from
>> [1].
>> > After importing all of it, when I try to run /dba/db-optimize/ on it
>> (which must have some corresponding command) I get the OOM error in the
>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
>> (the machine I'm running this on doesn't have a lot of memory).
>> > I was looking at [2] for some estimates of peak memory usage for this
>> "db-optimize" operation, but couldn't find any.
>> > Actually it would be nice to know peak memory usage because.. of
>> course, for any database (including BaseX) a common operation is to do
>> server sizing, to know what kind of server would be needed.
>> > In this case, it seems like 2GB memory is enough to import 340k
>> documents, weighing in at 30GB total, but it's not enough to run
>> "dba-optimize".
>> > Is there any info about peak memory usage on [2] ? And are there
>> guidelines for large-scale collection imports like I'm trying to do?
>> >
>> > Thanks,
>> > Stefan
>> >
>> > [1] https://www.linuxquestions.org/
>> > [2] http://docs.basex.org/wiki/Statistics
>> > [3] https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
..;..;...;..;..;..;... 1.659934928E7 ms (657 MB)
Indexing Text...         
.|....|....|....|....|....|.....|....|....|....|.. 160.09 M operations, 
642947.17 ms (438 MB).
Indexing Attribute Values...                                                    
 585.55 M operatio
ns, 2768524.64 ms (430 MB).                                  
Indexing Tokens...                        
 757.99 M operations, 4394525.69 ms (647 MB).
Indexing Full-Text...
yError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at org.basex.util.list.ByteList.add(ByteList.java:67)
        at org.basex.util.list.ByteList.add(ByteList.java:55)
        at org.basex.index.ft.FTBuilder.merge(FTBuilder.java:236)
        at org.basex.index.ft.FTBuilder.write(FTBuilder.java:147)
        at org.basex.index.ft.FTBuilder.build(FTBuilder.java:86)
        at org.basex.index.ft.FTBuilder.build(FTBuilder.java:1)
        at org.basex.data.DiskData.createIndex(DiskData.java:198)
        at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:100)
        at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:88)
        at org.basex.core.cmd.CreateDB$1.run(CreateDB.java:116)
        at org.basex.core.cmd.ACreate.update(ACreate.java:90)
        at org.basex.core.cmd.CreateDB.run(CreateDB.java:113)
        at org.basex.core.Command.run(Command.java:257)
        at org.basex.core.Command.execute(Command.java:93)
        at org.basex.server.ClientListener.run(ClientListener.java:140)
org.basex.core.BaseXException: Out of Main Memory.
You can increase Java's heap size with the flag -Xmx<size>.
        at org.basex.core.Command.execute(Command.java:94)
        at org.basex.server.ClientListener.run(ClientListener.java:140)

Attachment: mass-import.xq
Description: Binary data

Reply via email to