Hi Christian,
Thanks very much for your quick reply and all the information and examples. I
have found that the first example you provided that includes scoring takes
about 10 seconds with my database, whereas the second example that includes
scoring is nearly instantaneous (33 ms). I have no idea why that is the case,
but that second query is blowing my mind! Amazing! I will experiment further.
Many thanks,
Greg
From: Christian Grün
Date: Sunday, February 18, 2024 at 1:51 PM
To: Murray, Gregory
Cc: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Slow full-text querying
Dear Greg,
In BaseX, it’s the text nodes that are indexed. Here are some query that take
advantage of the full-text index:
db:get("theocom")[.//text() contains text 'apple']
db:get("theocom")//page[text() contains text 'apple']
db:get("theocom")//text()[. contains text 'apple']
...
If you check out the output of the GUI’s info panel, you’ll see whether the
full-text index is applied.
There are several ways to compute scores for documents; here are two variants:
let $db := 'theocom'
let $keywords := 'apple'
let $search := function($doc) { $doc//text() contains text { $keywords } }
for $doc in db:get($db)[$search(.)]
order by ft:score($search($doc)) descending
return db:path($doc)
let $keywords := "apple"
for $texts score $score in db:get("theocom")//text()[. contains text {
$keywords }]
group by $uri := db:path($texts)
let $scores := sum($score)
order by $scores descending
return ($scores, $uri, $texts)
By evaluating specific score values for text nodes, you have more freedom to
decide how to interpret the scores. For example, you can rank scores of title
elements higher than those of paragraphs.
I invite you to have a look at our documentation for more information and
examples [1,2].
Hope this helps,
Christian
[1] https://docs.basex.org/wiki/Full-Text
[2] https://docs.basex.org/wiki/Indexes#Full-Text_Index
On Sat, Feb 17, 2024 at 1:23 PM Murray, Gregory
mailto:gregory.mur...@ptsem.edu>> wrote:
Hello,
I have a database with several thousand XML documents, although I have tens of
thousands I’d like to add. Each XML document contains a book — both the
bibliographic metadata such as title, author, etc. (each in its own element)
and the complete OCR text of all pages of the book. Each page of text from each
book is in a element with a single text node containing all words from
that page in the book, resulting in large blocks of text.
I’ve added a full-text index and optimized it. I am finding that full-text
searching is very slow. The query shown below consistently takes about 20
seconds to run, even though there are only about 7400 documents. Obviously
that’s far too slow to use the query in a web application, where the user
expects a quick response.
My first thought is whether the query is actually using the full-text index. Is
there a way for me to determine that?
I’m also wondering if my query is crude or is missing something. I don’t need
the text nodes containing the search words; I only need to know which documents
contain the words.
let $keywords := "apple"
for $doc in collection("theocom")
let score $score := $doc contains text {$keywords}
order by $score descending
where $score > 0
return concat($score, " ", base-uri($doc))
As you can see, I’m searching all text in the entirety of each book. Is there a
way to rewrite such a query for faster performance?
Also, I’m wondering if the structure of the XML documents is such that the
documents themselves need to have smaller blocks of text. For example, if the
OCR text were contained in elements, each containing only a single line
of text, as printed in the original physical book, would full-text searching be
noticeably faster, since each text node is much smaller?
Thanks,
Greg