Regarding selective full-text indexing, I just tried XQUERY db:optimize("linuxquestions.org-selective", true(), map { 'ftindex': true(), 'ftinclude': 'div table td a' }) And I got OOM on that, the exact stacktrace attached in this message.
I will open a separate thread regarding migrating the data from BaseX shards to PostgreSQL (for the purpose of full-text indexing). On Sun, Oct 6, 2019 at 10:19 AM Christian Grün <christian.gr...@gmail.com> wrote: > The current full text index builder provides a similar outsourcing > mechanism to that of the index builder for the default index structures; > but the meta data structures are kept in main-memory, and they are more > bulky. There are definitely ways to tackle this technically; it hasn't been > of high priority so far, but this may change. > > Please note that you won't create an index over your whole data set in > RDBMS. Instead, you'll usually create it for specific fields that you will > query later on. It's a convenience feature in BaseX that you can build an > index for all of your data. For large full-text corpora, however, it's > recommendable in most cases to restrict indexing to the relevant XML > elements. > > > > > first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. > 2019, 23:28: > >> Attached a more complete output of ./bin/basexhttp . Judging from this >> output, it would seem that everything was ok, except for the full-text >> index. >> I now realize that I have another question about full-text indexes. It >> seems like the full-text index here is dependent on the amount of memory >> available (in other words, the more data to be indexed, the more RAM memory >> required). >> >> I was using a certain popular RDBMS, for full-text indexing, and I never >> bumped into problems like it running out of memory when building such >> indexes. >> I think their model uses a certain buffer in memory, and it keeps >> multiple files on disk where it store data, and then it assembles together >> the results in-memory >> but always keeping the constraint of using only as much memory as was >> declared to be allowed for it to use. >> Perhaps the topic would be "external memory algorithms" or "full-text >> search using secondary storage". >> I'm not an expert in this field, but.. my question here would be if this >> kind of thing is something that BaseX is looking to handle in the future? >> >> Thanks, >> Stefan >> >> >> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün <christian.gr...@gmail.com> >> wrote: >> >>> The stack Trace indicates that you enabled the fulltext index as well. >>> For this index, you definitely need more memory than available on your >>> system. >>> >>> So I assume you didn't encounter trouble with the default index >>> structures? >>> >>> >>> >>> >>> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. >>> 2019, 20:52: >>> >>>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can >>>> allocate on that system for BaseX) and I got OOM. >>>> >>>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün < >>>> christian.gr...@gmail.com> wrote: >>>> >>>>> About option 1: How much memory have you been able to assign to the >>>>> Java VM? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. >>>>> 2019, 01:11: >>>>> >>>>>> I had another look at the script I wrote and realized that it's not >>>>>> working as it's supposed to. >>>>>> Apparently the order of operations should be this: >>>>>> - turn on all the types of indexes required >>>>>> - create the db >>>>>> - the parser settings and the filter settings >>>>>> - add all the files to the db >>>>>> - run "OPTIMIZE" >>>>>> >>>>>> If I'm not doing them in this order (specifically with "OPTIMIZE" at >>>>>> the end) the resulting db lacks all indexes. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 4, 2019 at 11:32 PM first name last name < >>>>>> randomcod...@gmail.com> wrote: >>>>>> >>>>>>> Hi Christian, >>>>>>> >>>>>>> About option 4: >>>>>>> I agree with the options you laid out. I am currently diving deeper >>>>>>> into option 4 in the list you wrote. >>>>>>> Regarding the partitioning strategy, I agree. I did manage however >>>>>>> to partition the files to be imported, into separate sets, with a >>>>>>> constraint on max partition size (on disk) and max partition file count >>>>>>> (the number of XML documents in each partition). >>>>>>> The tool called fpart [5] made this possible (I can imagine more >>>>>>> sophisticated bin-packing methods, involving pre-computed node count >>>>>>> values, and other variables, can be achieved via glpk [6] but that >>>>>>> might be >>>>>>> too much work). >>>>>>> So, currently I am experimenting with a max partition size of 2.4GB >>>>>>> and a max file count of 85k files, and fpart seems to have split the >>>>>>> file >>>>>>> list into 11 partitions of 33k files each and the size of a partition >>>>>>> being >>>>>>> ~ 2.4GB. >>>>>>> So, I wrote a script for this, it's called sharded-import.sh and >>>>>>> attached here. I'm also noticing that the /dba/ BaseX web interface is >>>>>>> not >>>>>>> blocked anymore if I run this script, as opposed to running the previous >>>>>>> import where I run >>>>>>> CREATE DB db_name /directory/ >>>>>>> which allows me to see the progress or allows me to run queries >>>>>>> before the big import finishes. >>>>>>> Maybe the downside is that it's more verbose, and prints out a ton >>>>>>> of lines like >>>>>>> > ADD /share/Public/archive/tech-sites/ >>>>>>> linuxquestions.org/threads/viewtopic_9_356613.html >>>>>>> Resource(s) added in 47.76 ms. >>>>>>> along the way, and maybe that's slower than before. >>>>>>> >>>>>>> About option 1: >>>>>>> Re: increase memory, I am running these experiments on a low-memory, >>>>>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had >>>>>>> to >>>>>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of >>>>>>> memory), >>>>>>> and I can't seem to find around the house any additional memory sticks >>>>>>> to >>>>>>> take it up to 8GB (which is also the maximum memory it supports). And >>>>>>> if I >>>>>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory >>>>>>> has to >>>>>>> match what it supports, I'm having trouble finding the exact one, >>>>>>> Corsair >>>>>>> says it has memory sticks that would work, but I'd have to wait weeks >>>>>>> for >>>>>>> them to ship to Bucharest which is where I live. >>>>>>> It seems like buying an Intel NUC that goes up to 64GB of memory >>>>>>> would be a bit too expensive at $1639 [9] but .. people on reddit [10] >>>>>>> were >>>>>>> discussing some years back about this supermicro server [11] which is >>>>>>> only >>>>>>> $668 and would allow to add up to 64GB of memory. >>>>>>> Basically I would buy something cheap that I can jampack with a lot >>>>>>> of RAM, but a hands-off approach would be best here, so if it comes >>>>>>> pre-equipped with all the memory and everything, would be nice (would >>>>>>> spare >>>>>>> the trouble of having to buy the memory separate, making sure it matches >>>>>>> the motherboard specs etc). >>>>>>> >>>>>>> About option 2: >>>>>>> In fact, that's a great idea. But it would require me to write >>>>>>> something that would figure out the XPath patterns where the actual >>>>>>> content >>>>>>> sits. I actually wanted to look for some algorithm that's designed to do >>>>>>> that, and try to implement it, but I had no time. >>>>>>> It would either have to detect the repetitive bloated nodes, and >>>>>>> build XPaths for the rest of the nodes, where the actual content sits. I >>>>>>> think this would be equivalent to computing the "web template" of a >>>>>>> website, given all its pages. >>>>>>> It would definitely decrease the size of the content that would have >>>>>>> to be indexed. >>>>>>> By the way, here I'm writing about a more general procedure, because >>>>>>> it's not just this dataset that I want to import.. I want to import >>>>>>> heavy, >>>>>>> large amounts of data :) >>>>>>> >>>>>>> These are my thoughts for now >>>>>>> >>>>>>> [5] https://github.com/martymac/fpart >>>>>>> [6] https://www.gnu.org/software/glpk/ >>>>>>> [7] https://www.amazon.com/dp/B015VNLGF8 >>>>>>> [8] https://www.qnap.com/en/product/ts-451+ >>>>>>> [9] >>>>>>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/ >>>>>>> [10] >>>>>>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/ >>>>>>> [11] >>>>>>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün < >>>>>>> christian.gr...@gmail.com> wrote: >>>>>>> >>>>>>>> Exactly, it seems to be the final MERGE step during index creation >>>>>>>> that blows up your system. If you are restricted to the 2 GB of >>>>>>>> main-memory, this is what you could try next: >>>>>>>> >>>>>>>> 1. Did you already try to tweak the JVM memory limit via -Xmx? >>>>>>>> What’s >>>>>>>> the largest value that you can assign on your system? >>>>>>>> >>>>>>>> 2. If you will query only specific values of your data sets, you can >>>>>>>> restrict your indexes to specific elements or attributes; this will >>>>>>>> reduce memory consumption (see [1] for details). If you observe that >>>>>>>> no indexes will be utilized in your queries anyway, you can simply >>>>>>>> disable the text and attribute indexes, and memory usage will shrink >>>>>>>> even more. >>>>>>>> >>>>>>>> 3. Create your database on a more powerful system [2] and move it to >>>>>>>> your target machine (makes only sense if there’s no need for further >>>>>>>> updates). >>>>>>>> >>>>>>>> 4. Distribute your data across multiple databases. In some way, this >>>>>>>> is comparable to sharding; it cannot be automated, though, as the >>>>>>>> partitioning strategy depends on the characteristics of your XML >>>>>>>> input >>>>>>>> data (some people have huge standalone documents, others have >>>>>>>> millions >>>>>>>> of small documents, …). >>>>>>>> >>>>>>>> [1] http://docs.basex.org/wiki/Indexes >>>>>>>> [2] A single CREATE call may be sufficient: CREATE DB database >>>>>>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name >>>>>>>> <randomcod...@gmail.com> wrote: >>>>>>>> > >>>>>>>> > I tried again, using SPLITSIZE = 12 in the .basex config file >>>>>>>> > The batch(console) script I used is attached mass-import.xq >>>>>>>> > This time I didn't do the optimize or index creation post-import, >>>>>>>> but instead, I did it as part of the import similar to what >>>>>>>> > is described in [4]. >>>>>>>> > This time I got a different error, that is, >>>>>>>> "org.basex.core.BaseXException: Out of Main Memory." >>>>>>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make >>>>>>>> any difference here? >>>>>>>> > >>>>>>>> > Thanks >>>>>>>> > >>>>>>>> > [4] http://docs.basex.org/wiki/Indexes#Performance >>>>>>>> > >>>>>>>> > >>>>>>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name < >>>>>>>> randomcod...@gmail.com> wrote: >>>>>>>> >> >>>>>>>> >> Hey Christian, >>>>>>>> >> >>>>>>>> >> Thank you for your answer :) >>>>>>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen >>>>>>>> the same OOM behavior. It looks like the memory consumption is moderate >>>>>>>> until when it reaches about 30GB (the size of the db before optimize) >>>>>>>> and >>>>>>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying >>>>>>>> with SPLITSIZE = 1000 and will report back if I get OOM again. >>>>>>>> >> Regarding what you said, it might be that the merge step is >>>>>>>> where the OOM occurs (I wonder if there's any way to control how much >>>>>>>> memory is being used inside the merge step). >>>>>>>> >> >>>>>>>> >> To quote the statistics page from the wiki: >>>>>>>> >> Databases in BaseX are light-weight. If a database limit is >>>>>>>> reached, you can distribute your documents across multiple database >>>>>>>> instances and access all of them with a single XQuery expression. >>>>>>>> >> This to me sounds like sharding. I would probably be able to >>>>>>>> split the documents into chunks and upload them under a db with the >>>>>>>> same >>>>>>>> prefix, but varying suffix.. seems a lot like shards. By doing this >>>>>>>> >> I think I can avoid OOM, but if BaseX provides other, better, >>>>>>>> maybe native mechanisms of avoiding OOM, I would try them. >>>>>>>> >> >>>>>>>> >> Best regards, >>>>>>>> >> Stefan >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün < >>>>>>>> christian.gr...@gmail.com> wrote: >>>>>>>> >>> >>>>>>>> >>> Hi first name, >>>>>>>> >>> >>>>>>>> >>> If you optimize your database, the indexes will be rebuilt. In >>>>>>>> this >>>>>>>> >>> step, the builder tries to guess how much free memory is still >>>>>>>> >>> available. If memory is exhausted, parts of the index will be >>>>>>>> split >>>>>>>> >>> (i. e., partially written to disk) and merged in a final step. >>>>>>>> >>> However, you can circumvent the heuristics by manually >>>>>>>> assigning a >>>>>>>> >>> static split value; see [1] for more information. If you use >>>>>>>> the DBA, >>>>>>>> >>> you’ll need to assign this value to your .basex or the web.xml >>>>>>>> file >>>>>>>> >>> [2]. In order to find the best value for your setup, it may be >>>>>>>> easier >>>>>>>> >>> to play around with the BaseX GUI. >>>>>>>> >>> >>>>>>>> >>> As you have already seen in our statistics, an XML document has >>>>>>>> >>> various properties that may represent a limit for a single >>>>>>>> database. >>>>>>>> >>> Accordingly, these properties make it difficult to decide for >>>>>>>> the >>>>>>>> >>> system when the memory will be exhausted during an import or >>>>>>>> index >>>>>>>> >>> rebuild. >>>>>>>> >>> >>>>>>>> >>> In general, you’ll get best performance (and your memory >>>>>>>> consumption >>>>>>>> >>> will be lower) if you create your database and specify the data >>>>>>>> to be >>>>>>>> >>> imported in a single run. This is currently not possible via >>>>>>>> the DBA; >>>>>>>> >>> use the GUI (Create Database) or console mode (CREATE DB >>>>>>>> command) >>>>>>>> >>> instead. >>>>>>>> >>> >>>>>>>> >>> Hope this helps, >>>>>>>> >>> Christian >>>>>>>> >>> >>>>>>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE >>>>>>>> >>> [2] http://docs.basex.org/wiki/Configuration >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name >>>>>>>> >>> <randomcod...@gmail.com> wrote: >>>>>>>> >>> > >>>>>>>> >>> > Hi, >>>>>>>> >>> > >>>>>>>> >>> > Let's say there's a 30GB dataset [3] containing most >>>>>>>> threads/posts from [1]. >>>>>>>> >>> > After importing all of it, when I try to run >>>>>>>> /dba/db-optimize/ on it (which must have some corresponding command) I >>>>>>>> get >>>>>>>> the OOM error in the stacktrace attached. I am using -Xmx2g so BaseX is >>>>>>>> limited to 2GB of memory (the machine I'm running this on doesn't have >>>>>>>> a >>>>>>>> lot of memory). >>>>>>>> >>> > I was looking at [2] for some estimates of peak memory usage >>>>>>>> for this "db-optimize" operation, but couldn't find any. >>>>>>>> >>> > Actually it would be nice to know peak memory usage because.. >>>>>>>> of course, for any database (including BaseX) a common operation is to >>>>>>>> do >>>>>>>> server sizing, to know what kind of server would be needed. >>>>>>>> >>> > In this case, it seems like 2GB memory is enough to import >>>>>>>> 340k documents, weighing in at 30GB total, but it's not enough to run >>>>>>>> "dba-optimize". >>>>>>>> >>> > Is there any info about peak memory usage on [2] ? And are >>>>>>>> there guidelines for large-scale collection imports like I'm trying to >>>>>>>> do? >>>>>>>> >>> > >>>>>>>> >>> > Thanks, >>>>>>>> >>> > Stefan >>>>>>>> >>> > >>>>>>>> >>> > [1] https://www.linuxquestions.org/ >>>>>>>> >>> > [2] http://docs.basex.org/wiki/Statistics >>>>>>>> >>> > [3] >>>>>>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2 >>>>>>>> >>>>>>>
java.io.FileNotFoundException: /share/CACHEDEV1_DATA/Public/builds/basex/data/linuxquestions.org-selective_1337525745/inf.basex (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.basex.io.IOFile.outputStream(IOFile.java:158) at org.basex.io.out.DataOutput.<init>(DataOutput.java:47) at org.basex.io.out.DataOutput.<init>(DataOutput.java:36) at org.basex.data.DiskData.write(DiskData.java:137) at org.basex.data.DiskData.close(DiskData.java:160) at org.basex.core.cmd.OptimizeAll.optimizeAll(OptimizeAll.java:145) at org.basex.query.up.primitives.db.DBOptimize.apply(DBOptimize.java:124) at org.basex.query.up.DataUpdates.apply(DataUpdates.java:175) at org.basex.query.up.ContextModifier.apply(ContextModifier.java:120) at org.basex.query.up.Updates.apply(Updates.java:178) at org.basex.query.QueryContext.update(QueryContext.java:701) at org.basex.query.QueryContext.iter(QueryContext.java:332) at org.basex.query.QueryProcessor.iter(QueryProcessor.java:90) at org.basex.core.cmd.AQuery.query(AQuery.java:107) at org.basex.core.cmd.XQuery.run(XQuery.java:22) at org.basex.core.Command.run(Command.java:257) at org.basex.core.Command.execute(Command.java:93) at org.basex.server.ClientListener.run(ClientListener.java:140)