Re: [basex-talk] Slow full-text querying

Murray, Gregory Sun, 18 Feb 2024 12:14:59 -0800

Hi Christian,

Thanks very much for your quick reply and all the information and examples. I 
have found that the first example you provided that includes scoring takes 
about 10 seconds with my database, whereas the second example that includes 
scoring is nearly instantaneous (33 ms). I have no idea why that is the case, 
but that second query is blowing my mind! Amazing! I will experiment further.

Many thanks,
Greg

From: Christian Grün <christian.gr...@gmail.com>
Date: Sunday, February 18, 2024 at 1:51 PM
To: Murray, Gregory <gregory.mur...@ptsem.edu>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Slow full-text querying
Dear Greg,

In BaseX, it’s the text nodes that are indexed. Here are some query that take 
advantage of the full-text index:

db:get("theocom")[.//text() contains text 'apple']
db:get("theocom")//page[text() contains text 'apple']
db:get("theocom")//text()[. contains text 'apple']
...

If you check out the output of the GUI’s info panel, you’ll see whether the 
full-text index is applied.

There are several ways to compute scores for documents; here are two variants:

let $db := 'theocom'
let $keywords := 'apple'
let $search := function($doc) { $doc//text() contains text { $keywords } }
for $doc in db:get($db)[$search(.)]
order by ft:score($search($doc)) descending
return db:path($doc)

let $keywords := "apple"
for $texts score $score in db:get("theocom")//text()[. contains text { 
$keywords }]
group by $uri := db:path($texts)
let $scores := sum($score)
order by $scores descending
return ($scores, $uri, $texts)

By evaluating specific score values for text nodes, you have more freedom to 
decide how to interpret the scores. For example, you can rank scores of title 
elements higher than those of paragraphs.

I invite you to have a look at our documentation for more information and 
examples [1,2].

Hope this helps,
Christian

[1] https://docs.basex.org/wiki/Full-Text
[2] https://docs.basex.org/wiki/Indexes#Full-Text_Index

On Sat, Feb 17, 2024 at 1:23 PM Murray, Gregory 
<gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> wrote:
Hello,

I have a database with several thousand XML documents, although I have tens of 
thousands I’d like to add. Each XML document contains a book — both the 
bibliographic metadata such as title, author, etc. (each in its own element) 
and the complete OCR text of all pages of the book. Each page of text from each 
book is in a <page> element with a single text node containing all words from 
that page in the book, resulting in large blocks of text.

I’ve added a full-text index and optimized it. I am finding that full-text 
searching is very slow. The query shown below consistently takes about 20 
seconds to run, even though there are only about 7400 documents. Obviously 
that’s far too slow to use the query in a web application, where the user 
expects a quick response.

My first thought is whether the query is actually using the full-text index. Is 
there a way for me to determine that?

I’m also wondering if my query is crude or is missing something. I don’t need 
the text nodes containing the search words; I only need to know which documents 
contain the words.

let $keywords := "apple"
for $doc in collection("theocom")
let score $score := $doc contains text {$keywords}
order by $score descending
where $score > 0
return concat($score, " ", base-uri($doc))

As you can see, I’m searching all text in the entirety of each book. Is there a 
way to rewrite such a query for faster performance?

Also, I’m wondering if the structure of the XML documents is such that the 
documents themselves need to have smaller blocks of text. For example, if the 
OCR text were contained in <line> elements, each containing only a single line 
of text, as printed in the original physical book, would full-text searching be 
noticeably faster, since each text node is much smaller?

Thanks,
Greg

Re: [basex-talk] Slow full-text querying

Reply via email to