Re: [basex-talk] Out of Main Memory

Murray, Gregory Fri, 15 Mar 2024 09:13:03 -0700

Thanks, Christian. Distributing documents across many databases sounds fine, as 
long as XPath expressions and full-text searching remain reasonably efficient. 
In the documentation, the example of addressing multiple databases uses a loop:


for $i in 1 to 100
return db:get('books' || $i)//book/title

Is that the preferred technique?

Also, is it possible to perform searches in the same manner without interfering 
with relevance scores?

Thanks,
Greg

From: Christian Grün <christian.gr...@gmail.com>
Date: Friday, March 15, 2024 at 11:51 AM
To: Murray, Gregory <gregory.mur...@ptsem.edu>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Out of Main Memory
Hi Greg,

I would have guessed that 12 GB is enough for 4.7 GB; but it sometimes depends 
on the input. If you like, you can share a single typical document with us, and 
we can have a look at it. 61 GB will be too large for a complete full-text 
index, though. However, it’s always possible to distribute documets across 
multiple databases and access them with a single query [1].

The full-text index is not incremental (in opposition to the other index 
structures), which means it must be re-created it after updates. However, it’s 
possible to re-index updated database instances and query fully indexed 
databases at the same time.

Hope this helps,
Christian

[1] https://docs.basex.org/wiki/Databases


On Thu, Mar 14, 2024 at 10:58 PM Murray, Gregory 
<gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> wrote:
Thanks, Christian. I don’t think selective indexing is applicable in my use 
case, because I need to perform full-text searches on the entirety of each 
document. Each XML document represents a physical book that was digitized, and 
the structure of each document is essentially a header with metadata and a body 
with the OCR text of the book. The OCR text is split into pages, where one 
<page> element contains all the words from one corresponding printed page from 
the physical book. Obviously the number of words in each <page> varies widely 
based on the physical dimensions of the book and the typeface.

So far, I have loaded 12,331 documents, containing a total of 2,196,771 pages. 
The total size of those XML documents on disk is 4.7GB. But that is only a 
fraction of the total number of documents I want to load into BaseX. The total 
number is more like 160,000 documents. Assuming that the documents I’ve loaded 
so far are a representative sample, and I believe that’s true, then the total 
size of the XML documents on disk, prior to loading them into BaseX, would be 
about 4.7GB * 13 = 61.1GB.

Normally the OCR text, once loaded, almost never changes. But the metadata 
fields do change as corrections are made. Also we add more XML documents 
routinely as we digitize more books over time. Therefore updates and additions 
are commonplace, such that keeping indexes up to date is important, to allow 
full-text searches to stay performant. I’m wondering if there are techniques 
for optimizing such quantities of text.

Thanks,
Greg

From: Christian Grün 
<christian.gr...@gmail.com<mailto:christian.gr...@gmail.com>>
Date: Thursday, March 14, 2024 at 8:48 AM
To: Murray, Gregory <gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>>
Cc: 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> 
<basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>>
Subject: Re: [basex-talk] Out of Main Memory
Hi Greg,

A quick reply: If only parts of your documents are relevant for full-text 
queries, you can restrict the selection with the FTINDEX option (see [1] for 
more information).

How large is the total size of your input documents?

Best,
Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing



On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory 
<gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> wrote:
Hello,

I’m working with a database that has a full-text index. I have found that if I 
iteratively add XML documents, then optimize, add more documents, optimize 
again, and so on, eventually the “optimize” command will fail with “Out of Main 
Memory.” I edited the basex startup script to change the memory allocation from 
-Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses 
up some of it. I have found that if I exit memory-hungry programs (web browser, 
Oxygen), start basex, and then run the “optimize” command, I still get “Out of 
Main Memory.” I’m wondering if there are any known workarounds or strategies 
for this situation. If I understand the documentation about indexes correctly, 
index data is periodically written to disk during optimization. Does this mean 
that running optimize again will pick up where the previous attempt left off, 
such that running optimize repeatedly will eventually succeed?

Thanks,
Greg


Gregory Murray
Director of Digital Initiatives
Wright Library
Princeton Theological Seminary

Re: [basex-talk] Out of Main Memory

Reply via email to