Hi Christian, Thanks very much for your quick reply and all the information and examples. I have found that the first example you provided that includes scoring takes about 10 seconds with my database, whereas the second example that includes scoring is nearly instantaneous (33 ms). I have no idea why that is the case, but that second query is blowing my mind! Amazing! I will experiment further.
Many thanks, Greg From: Christian Grün <christian.gr...@gmail.com> Date: Sunday, February 18, 2024 at 1:51 PM To: Murray, Gregory <gregory.mur...@ptsem.edu> Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Slow full-text querying Dear Greg, In BaseX, it’s the text nodes that are indexed. Here are some query that take advantage of the full-text index: db:get("theocom")[.//text() contains text 'apple'] db:get("theocom")//page[text() contains text 'apple'] db:get("theocom")//text()[. contains text 'apple'] ... If you check out the output of the GUI’s info panel, you’ll see whether the full-text index is applied. There are several ways to compute scores for documents; here are two variants: let $db := 'theocom' let $keywords := 'apple' let $search := function($doc) { $doc//text() contains text { $keywords } } for $doc in db:get($db)[$search(.)] order by ft:score($search($doc)) descending return db:path($doc) let $keywords := "apple" for $texts score $score in db:get("theocom")//text()[. contains text { $keywords }] group by $uri := db:path($texts) let $scores := sum($score) order by $scores descending return ($scores, $uri, $texts) By evaluating specific score values for text nodes, you have more freedom to decide how to interpret the scores. For example, you can rank scores of title elements higher than those of paragraphs. I invite you to have a look at our documentation for more information and examples [1,2]. Hope this helps, Christian [1] https://docs.basex.org/wiki/Full-Text [2] https://docs.basex.org/wiki/Indexes#Full-Text_Index On Sat, Feb 17, 2024 at 1:23 PM Murray, Gregory <gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> wrote: Hello, I have a database with several thousand XML documents, although I have tens of thousands I’d like to add. Each XML document contains a book — both the bibliographic metadata such as title, author, etc. (each in its own element) and the complete OCR text of all pages of the book. Each page of text from each book is in a <page> element with a single text node containing all words from that page in the book, resulting in large blocks of text. I’ve added a full-text index and optimized it. I am finding that full-text searching is very slow. The query shown below consistently takes about 20 seconds to run, even though there are only about 7400 documents. Obviously that’s far too slow to use the query in a web application, where the user expects a quick response. My first thought is whether the query is actually using the full-text index. Is there a way for me to determine that? I’m also wondering if my query is crude or is missing something. I don’t need the text nodes containing the search words; I only need to know which documents contain the words. let $keywords := "apple" for $doc in collection("theocom") let score $score := $doc contains text {$keywords} order by $score descending where $score > 0 return concat($score, " ", base-uri($doc)) As you can see, I’m searching all text in the entirety of each book. Is there a way to rewrite such a query for faster performance? Also, I’m wondering if the structure of the XML documents is such that the documents themselves need to have smaller blocks of text. For example, if the OCR text were contained in <line> elements, each containing only a single line of text, as printed in the original physical book, would full-text searching be noticeably faster, since each text node is much smaller? Thanks, Greg