Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can allocate on that system for BaseX) and I got OOM.
On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <christian.gr...@gmail.com> wrote: > About option 1: How much memory have you been able to assign to the Java > VM? > > > > > > first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. > 2019, 01:11: > >> I had another look at the script I wrote and realized that it's not >> working as it's supposed to. >> Apparently the order of operations should be this: >> - turn on all the types of indexes required >> - create the db >> - the parser settings and the filter settings >> - add all the files to the db >> - run "OPTIMIZE" >> >> If I'm not doing them in this order (specifically with "OPTIMIZE" at the >> end) the resulting db lacks all indexes. >> >> >> >> On Fri, Oct 4, 2019 at 11:32 PM first name last name < >> randomcod...@gmail.com> wrote: >> >>> Hi Christian, >>> >>> About option 4: >>> I agree with the options you laid out. I am currently diving deeper into >>> option 4 in the list you wrote. >>> Regarding the partitioning strategy, I agree. I did manage however to >>> partition the files to be imported, into separate sets, with a constraint >>> on max partition size (on disk) and max partition file count (the number of >>> XML documents in each partition). >>> The tool called fpart [5] made this possible (I can imagine more >>> sophisticated bin-packing methods, involving pre-computed node count >>> values, and other variables, can be achieved via glpk [6] but that might be >>> too much work). >>> So, currently I am experimenting with a max partition size of 2.4GB and >>> a max file count of 85k files, and fpart seems to have split the file list >>> into 11 partitions of 33k files each and the size of a partition being ~ >>> 2.4GB. >>> So, I wrote a script for this, it's called sharded-import.sh and >>> attached here. I'm also noticing that the /dba/ BaseX web interface is not >>> blocked anymore if I run this script, as opposed to running the previous >>> import where I run >>> CREATE DB db_name /directory/ >>> which allows me to see the progress or allows me to run queries before >>> the big import finishes. >>> Maybe the downside is that it's more verbose, and prints out a ton of >>> lines like >>> > ADD /share/Public/archive/tech-sites/ >>> linuxquestions.org/threads/viewtopic_9_356613.html >>> Resource(s) added in 47.76 ms. >>> along the way, and maybe that's slower than before. >>> >>> About option 1: >>> Re: increase memory, I am running these experiments on a low-memory, >>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to >>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), >>> and I can't seem to find around the house any additional memory sticks to >>> take it up to 8GB (which is also the maximum memory it supports). And if I >>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to >>> match what it supports, I'm having trouble finding the exact one, Corsair >>> says it has memory sticks that would work, but I'd have to wait weeks for >>> them to ship to Bucharest which is where I live. >>> It seems like buying an Intel NUC that goes up to 64GB of memory would >>> be a bit too expensive at $1639 [9] but .. people on reddit [10] were >>> discussing some years back about this supermicro server [11] which is only >>> $668 and would allow to add up to 64GB of memory. >>> Basically I would buy something cheap that I can jampack with a lot of >>> RAM, but a hands-off approach would be best here, so if it comes >>> pre-equipped with all the memory and everything, would be nice (would spare >>> the trouble of having to buy the memory separate, making sure it matches >>> the motherboard specs etc). >>> >>> About option 2: >>> In fact, that's a great idea. But it would require me to write something >>> that would figure out the XPath patterns where the actual content sits. I >>> actually wanted to look for some algorithm that's designed to do that, and >>> try to implement it, but I had no time. >>> It would either have to detect the repetitive bloated nodes, and build >>> XPaths for the rest of the nodes, where the actual content sits. I think >>> this would be equivalent to computing the "web template" of a website, >>> given all its pages. >>> It would definitely decrease the size of the content that would have to >>> be indexed. >>> By the way, here I'm writing about a more general procedure, because >>> it's not just this dataset that I want to import.. I want to import heavy, >>> large amounts of data :) >>> >>> These are my thoughts for now >>> >>> [5] https://github.com/martymac/fpart >>> [6] https://www.gnu.org/software/glpk/ >>> [7] https://www.amazon.com/dp/B015VNLGF8 >>> [8] https://www.qnap.com/en/product/ts-451+ >>> [9] >>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/ >>> [10] >>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/ >>> [11] >>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E >>> >>> >>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün <christian.gr...@gmail.com> >>> wrote: >>> >>>> Exactly, it seems to be the final MERGE step during index creation >>>> that blows up your system. If you are restricted to the 2 GB of >>>> main-memory, this is what you could try next: >>>> >>>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s >>>> the largest value that you can assign on your system? >>>> >>>> 2. If you will query only specific values of your data sets, you can >>>> restrict your indexes to specific elements or attributes; this will >>>> reduce memory consumption (see [1] for details). If you observe that >>>> no indexes will be utilized in your queries anyway, you can simply >>>> disable the text and attribute indexes, and memory usage will shrink >>>> even more. >>>> >>>> 3. Create your database on a more powerful system [2] and move it to >>>> your target machine (makes only sense if there’s no need for further >>>> updates). >>>> >>>> 4. Distribute your data across multiple databases. In some way, this >>>> is comparable to sharding; it cannot be automated, though, as the >>>> partitioning strategy depends on the characteristics of your XML input >>>> data (some people have huge standalone documents, others have millions >>>> of small documents, …). >>>> >>>> [1] http://docs.basex.org/wiki/Indexes >>>> [2] A single CREATE call may be sufficient: CREATE DB database >>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz >>>> >>>> >>>> >>>> >>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name >>>> <randomcod...@gmail.com> wrote: >>>> > >>>> > I tried again, using SPLITSIZE = 12 in the .basex config file >>>> > The batch(console) script I used is attached mass-import.xq >>>> > This time I didn't do the optimize or index creation post-import, but >>>> instead, I did it as part of the import similar to what >>>> > is described in [4]. >>>> > This time I got a different error, that is, >>>> "org.basex.core.BaseXException: Out of Main Memory." >>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any >>>> difference here? >>>> > >>>> > Thanks >>>> > >>>> > [4] http://docs.basex.org/wiki/Indexes#Performance >>>> > >>>> > >>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name < >>>> randomcod...@gmail.com> wrote: >>>> >> >>>> >> Hey Christian, >>>> >> >>>> >> Thank you for your answer :) >>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the >>>> same OOM behavior. It looks like the memory consumption is moderate until >>>> when it reaches about 30GB (the size of the db before optimize) and >>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying with >>>> SPLITSIZE = 1000 and will report back if I get OOM again. >>>> >> Regarding what you said, it might be that the merge step is where >>>> the OOM occurs (I wonder if there's any way to control how much memory is >>>> being used inside the merge step). >>>> >> >>>> >> To quote the statistics page from the wiki: >>>> >> Databases in BaseX are light-weight. If a database limit is >>>> reached, you can distribute your documents across multiple database >>>> instances and access all of them with a single XQuery expression. >>>> >> This to me sounds like sharding. I would probably be able to split >>>> the documents into chunks and upload them under a db with the same prefix, >>>> but varying suffix.. seems a lot like shards. By doing this >>>> >> I think I can avoid OOM, but if BaseX provides other, better, maybe >>>> native mechanisms of avoiding OOM, I would try them. >>>> >> >>>> >> Best regards, >>>> >> Stefan >>>> >> >>>> >> >>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün < >>>> christian.gr...@gmail.com> wrote: >>>> >>> >>>> >>> Hi first name, >>>> >>> >>>> >>> If you optimize your database, the indexes will be rebuilt. In this >>>> >>> step, the builder tries to guess how much free memory is still >>>> >>> available. If memory is exhausted, parts of the index will be split >>>> >>> (i. e., partially written to disk) and merged in a final step. >>>> >>> However, you can circumvent the heuristics by manually assigning a >>>> >>> static split value; see [1] for more information. If you use the >>>> DBA, >>>> >>> you’ll need to assign this value to your .basex or the web.xml file >>>> >>> [2]. In order to find the best value for your setup, it may be >>>> easier >>>> >>> to play around with the BaseX GUI. >>>> >>> >>>> >>> As you have already seen in our statistics, an XML document has >>>> >>> various properties that may represent a limit for a single database. >>>> >>> Accordingly, these properties make it difficult to decide for the >>>> >>> system when the memory will be exhausted during an import or index >>>> >>> rebuild. >>>> >>> >>>> >>> In general, you’ll get best performance (and your memory consumption >>>> >>> will be lower) if you create your database and specify the data to >>>> be >>>> >>> imported in a single run. This is currently not possible via the >>>> DBA; >>>> >>> use the GUI (Create Database) or console mode (CREATE DB command) >>>> >>> instead. >>>> >>> >>>> >>> Hope this helps, >>>> >>> Christian >>>> >>> >>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE >>>> >>> [2] http://docs.basex.org/wiki/Configuration >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name >>>> >>> <randomcod...@gmail.com> wrote: >>>> >>> > >>>> >>> > Hi, >>>> >>> > >>>> >>> > Let's say there's a 30GB dataset [3] containing most >>>> threads/posts from [1]. >>>> >>> > After importing all of it, when I try to run /dba/db-optimize/ on >>>> it (which must have some corresponding command) I get the OOM error in the >>>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory >>>> (the machine I'm running this on doesn't have a lot of memory). >>>> >>> > I was looking at [2] for some estimates of peak memory usage for >>>> this "db-optimize" operation, but couldn't find any. >>>> >>> > Actually it would be nice to know peak memory usage because.. of >>>> course, for any database (including BaseX) a common operation is to do >>>> server sizing, to know what kind of server would be needed. >>>> >>> > In this case, it seems like 2GB memory is enough to import 340k >>>> documents, weighing in at 30GB total, but it's not enough to run >>>> "dba-optimize". >>>> >>> > Is there any info about peak memory usage on [2] ? And are there >>>> guidelines for large-scale collection imports like I'm trying to do? >>>> >>> > >>>> >>> > Thanks, >>>> >>> > Stefan >>>> >>> > >>>> >>> > [1] https://www.linuxquestions.org/ >>>> >>> > [2] http://docs.basex.org/wiki/Statistics >>>> >>> > [3] >>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2 >>>> >>>
Error: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at org.basex.util.list.ByteList.add(ByteList.java:67) at org.basex.util.list.ByteList.add(ByteList.java:55) at org.basex.index.ft.FTBuilder.merge(FTBuilder.java:236) at org.basex.index.ft.FTBuilder.write(FTBuilder.java:147) at org.basex.index.ft.FTBuilder.build(FTBuilder.java:86) at org.basex.index.ft.FTBuilder.build(FTBuilder.java:1) at org.basex.data.DiskData.createIndex(DiskData.java:198) at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:100) at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:88) at org.basex.core.cmd.CreateDB$1.run(CreateDB.java:116) at org.basex.core.cmd.ACreate.update(ACreate.java:90) at org.basex.core.cmd.CreateDB.run(CreateDB.java:113) at org.basex.core.Command.run(Command.java:257) at org.basex.core.Command.execute(Command.java:93) at org.basex.server.ClientListener.run(ClientListener.java:140) org.basex.core.BaseXException: Out of Main Memory. You can increase Java's heap size with the flag -Xmx<size>. at org.basex.core.Command.execute(Command.java:94) at org.basex.server.ClientListener.run(ClientListener.java:140)