Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

first name last name Sat, 05 Oct 2019 11:52:28 -0700

Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
allocate on that system for BaseX) and I got OOM.


On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <christian.gr...@gmail.com>
wrote:

> About option 1: How much memory have you been able to assign to the Java
> VM?
>
>
>
>
>
> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt.
> 2019, 01:11:
>
>> I had another look at the script I wrote and realized that it's not
>> working as it's supposed to.
>> Apparently the order of operations should be this:
>> - turn on all the types of indexes required
>> - create the db
>> - the parser settings and the filter settings
>> - add all the files to the db
>> - run "OPTIMIZE"
>>
>> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
>> end) the resulting db lacks all indexes.
>>
>>
>>
>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>> randomcod...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> About option 4:
>>> I agree with the options you laid out. I am currently diving deeper into
>>> option 4 in the list you wrote.
>>> Regarding the partitioning strategy, I agree. I did manage however to
>>> partition the files to be imported, into separate sets, with a constraint
>>> on max partition size (on disk) and max partition file count (the number of
>>> XML documents in each partition).
>>> The tool called fpart [5] made this possible (I can imagine more
>>> sophisticated bin-packing methods, involving pre-computed node count
>>> values, and other variables, can be achieved via glpk [6] but that might be
>>> too much work).
>>> So, currently I am experimenting with a max partition size of 2.4GB and
>>> a max file count of 85k files, and fpart seems to have split the file list
>>> into 11 partitions of 33k files each and the size of a partition being ~
>>> 2.4GB.
>>> So, I wrote a script for this, it's called sharded-import.sh and
>>> attached here. I'm also noticing that the /dba/ BaseX web interface is not
>>> blocked anymore if I run this script, as opposed to running the previous
>>> import where I run
>>>   CREATE DB db_name /directory/
>>> which allows me to see the progress or allows me to run queries before
>>> the big import finishes.
>>> Maybe the downside is that it's more verbose, and prints out a ton of
>>> lines like
>>>   > ADD /share/Public/archive/tech-sites/
>>> linuxquestions.org/threads/viewtopic_9_356613.html
>>>   Resource(s) added in 47.76 ms.
>>> along the way, and maybe that's slower than before.
>>>
>>> About option 1:
>>> Re: increase memory, I am running these experiments on a low-memory,
>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to
>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory),
>>> and I can't seem to find around the house any additional memory sticks to
>>> take it up to 8GB (which is also the maximum memory it supports). And if I
>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
>>> match what it supports, I'm having trouble finding the exact one, Corsair
>>> says it has memory sticks that would work, but I'd have to wait weeks for
>>> them to ship to Bucharest which is where I live.
>>> It seems like buying an Intel NUC that goes up to 64GB of memory would
>>> be a bit too expensive at $1639 [9] but .. people on reddit [10] were
>>> discussing some years back about this supermicro server [11] which is only
>>> $668 and would allow to add up to 64GB of memory.
>>> Basically I would buy something cheap that I can jampack with a lot of
>>> RAM, but a hands-off approach would be best here, so if it comes
>>> pre-equipped with all the memory and everything, would be nice (would spare
>>> the trouble of having to buy the memory separate, making sure it matches
>>> the motherboard specs etc).
>>>
>>> About option 2:
>>> In fact, that's a great idea. But it would require me to write something
>>> that would figure out the XPath patterns where the actual content sits. I
>>> actually wanted to look for some algorithm that's designed to do that, and
>>> try to implement it, but I had no time.
>>> It would either have to detect the repetitive bloated nodes, and build
>>> XPaths for the rest of the nodes, where the actual content sits. I think
>>> this would be equivalent to computing the "web template" of a website,
>>> given all its pages.
>>> It would definitely decrease the size of the content that would have to
>>> be indexed.
>>> By the way, here I'm writing about a more general procedure, because
>>> it's not just this dataset that I want to import.. I want to import heavy,
>>> large amounts of data :)
>>>
>>> These are my thoughts for now
>>>
>>> [5] https://github.com/martymac/fpart
>>> [6] https://www.gnu.org/software/glpk/
>>> [7] https://www.amazon.com/dp/B015VNLGF8
>>> [8] https://www.qnap.com/en/product/ts-451+
>>> [9]
>>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
>>> [10]
>>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
>>> [11]
>>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>>>
>>>
>>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün <christian.gr...@gmail.com>
>>> wrote:
>>>
>>>> Exactly, it seems to be the final MERGE step during index creation
>>>> that blows up your system. If you are restricted to the 2 GB of
>>>> main-memory, this is what you could try next:
>>>>
>>>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s
>>>> the largest value that you can assign on your system?
>>>>
>>>> 2. If you will query only specific values of your data sets, you can
>>>> restrict your indexes to specific elements or attributes; this will
>>>> reduce memory consumption (see [1] for details). If you observe that
>>>> no indexes will be utilized in your queries anyway, you can simply
>>>> disable the text and attribute indexes, and memory usage will shrink
>>>> even more.
>>>>
>>>> 3. Create your database on a more powerful system [2] and move it to
>>>> your target machine (makes only sense if there’s no need for further
>>>> updates).
>>>>
>>>> 4. Distribute your data across multiple databases. In some way, this
>>>> is comparable to sharding; it cannot be automated, though, as the
>>>> partitioning strategy depends on the characteristics of your XML input
>>>> data (some people have huge standalone documents, others have millions
>>>> of small documents, …).
>>>>
>>>> [1] http://docs.basex.org/wiki/Indexes
>>>> [2] A single CREATE call may be sufficient: CREATE DB database
>>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name
>>>> <randomcod...@gmail.com> wrote:
>>>> >
>>>> > I tried again, using SPLITSIZE = 12 in the .basex config file
>>>> > The batch(console) script I used is attached mass-import.xq
>>>> > This time I didn't do the optimize or index creation post-import, but
>>>> instead, I did it as part of the import similar to what
>>>> > is described in [4].
>>>> > This time I got a different error, that is,
>>>> "org.basex.core.BaseXException: Out of Main Memory."
>>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
>>>> difference here?
>>>> >
>>>> > Thanks
>>>> >
>>>> > [4] http://docs.basex.org/wiki/Indexes#Performance
>>>> >
>>>> >
>>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name <
>>>> randomcod...@gmail.com> wrote:
>>>> >>
>>>> >> Hey Christian,
>>>> >>
>>>> >> Thank you for your answer :)
>>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the
>>>> same OOM behavior. It looks like the memory consumption is moderate until
>>>> when it reaches about 30GB (the size of the db before optimize) and
>>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying with
>>>> SPLITSIZE = 1000 and will report back if I get OOM again.
>>>> >> Regarding what you said, it might be that the merge step is where
>>>> the OOM occurs (I wonder if there's any way to control how much memory is
>>>> being used inside the merge step).
>>>> >>
>>>> >> To quote the statistics page from the wiki:
>>>> >>     Databases in BaseX are light-weight. If a database limit is
>>>> reached, you can distribute your documents across multiple database
>>>> instances and access all of them with a single XQuery expression.
>>>> >> This to me sounds like sharding. I would probably be able to split
>>>> the documents into chunks and upload them under a db with the same prefix,
>>>> but varying suffix.. seems a lot like shards. By doing this
>>>> >> I think I can avoid OOM, but if BaseX provides other, better, maybe
>>>> native mechanisms of avoiding OOM, I would try them.
>>>> >>
>>>> >> Best regards,
>>>> >> Stefan
>>>> >>
>>>> >>
>>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün <
>>>> christian.gr...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi first name,
>>>> >>>
>>>> >>> If you optimize your database, the indexes will be rebuilt. In this
>>>> >>> step, the builder tries to guess how much free memory is still
>>>> >>> available. If memory is exhausted, parts of the index will be split
>>>> >>> (i. e., partially written to disk) and merged in a final step.
>>>> >>> However, you can circumvent the heuristics by manually assigning a
>>>> >>> static split value; see [1] for more information. If you use the
>>>> DBA,
>>>> >>> you’ll need to assign this value to your .basex or the web.xml file
>>>> >>> [2]. In order to find the best value for your setup, it may be
>>>> easier
>>>> >>> to play around with the BaseX GUI.
>>>> >>>
>>>> >>> As you have already seen in our statistics, an XML document has
>>>> >>> various properties that may represent a limit for a single database.
>>>> >>> Accordingly, these properties make it difficult to decide for the
>>>> >>> system when the memory will be exhausted during an import or index
>>>> >>> rebuild.
>>>> >>>
>>>> >>> In general, you’ll get best performance (and your memory consumption
>>>> >>> will be lower) if you create your database and specify the data to
>>>> be
>>>> >>> imported in a single run. This is currently not possible via the
>>>> DBA;
>>>> >>> use the GUI (Create Database) or console mode (CREATE DB command)
>>>> >>> instead.
>>>> >>>
>>>> >>> Hope this helps,
>>>> >>> Christian
>>>> >>>
>>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>>>> >>> [2] http://docs.basex.org/wiki/Configuration
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>>> >>> <randomcod...@gmail.com> wrote:
>>>> >>> >
>>>> >>> > Hi,
>>>> >>> >
>>>> >>> > Let's say there's a 30GB dataset [3] containing most
>>>> threads/posts from [1].
>>>> >>> > After importing all of it, when I try to run /dba/db-optimize/ on
>>>> it (which must have some corresponding command) I get the OOM error in the
>>>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
>>>> (the machine I'm running this on doesn't have a lot of memory).
>>>> >>> > I was looking at [2] for some estimates of peak memory usage for
>>>> this "db-optimize" operation, but couldn't find any.
>>>> >>> > Actually it would be nice to know peak memory usage because.. of
>>>> course, for any database (including BaseX) a common operation is to do
>>>> server sizing, to know what kind of server would be needed.
>>>> >>> > In this case, it seems like 2GB memory is enough to import 340k
>>>> documents, weighing in at 30GB total, but it's not enough to run
>>>> "dba-optimize".
>>>> >>> > Is there any info about peak memory usage on [2] ? And are there
>>>> guidelines for large-scale collection imports like I'm trying to do?
>>>> >>> >
>>>> >>> > Thanks,
>>>> >>> > Stefan
>>>> >>> >
>>>> >>> > [1] https://www.linuxquestions.org/
>>>> >>> > [2] http://docs.basex.org/wiki/Statistics
>>>> >>> > [3]
>>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>>>>
>>>

Error: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at org.basex.util.list.ByteList.add(ByteList.java:67)
        at org.basex.util.list.ByteList.add(ByteList.java:55)
        at org.basex.index.ft.FTBuilder.merge(FTBuilder.java:236)
        at org.basex.index.ft.FTBuilder.write(FTBuilder.java:147)
        at org.basex.index.ft.FTBuilder.build(FTBuilder.java:86)
        at org.basex.index.ft.FTBuilder.build(FTBuilder.java:1)
        at org.basex.data.DiskData.createIndex(DiskData.java:198)
        at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:100)
        at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:88)
        at org.basex.core.cmd.CreateDB$1.run(CreateDB.java:116)
        at org.basex.core.cmd.ACreate.update(ACreate.java:90)
        at org.basex.core.cmd.CreateDB.run(CreateDB.java:113)
        at org.basex.core.Command.run(Command.java:257)
        at org.basex.core.Command.execute(Command.java:93)
        at org.basex.server.ClientListener.run(ClientListener.java:140)
org.basex.core.BaseXException: Out of Main Memory.
You can increase Java's heap size with the flag -Xmx<size>.
        at org.basex.core.Command.execute(Command.java:94)
        at org.basex.server.ClientListener.run(ClientListener.java:140)

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

Reply via email to