Re: [basex-talk] Slow full-text querying

2024-02-18 Thread Christian Grün
Dear Greg,

In BaseX, it’s the text nodes that are indexed. Here are some query that
take advantage of the full-text index:

db:get("theocom")[.//text() contains text 'apple']
db:get("theocom")//page[text() contains text 'apple']
db:get("theocom")//text()[. contains text 'apple']
...

If you check out the output of the GUI’s info panel, you’ll see whether the
full-text index is applied.

There are several ways to compute scores for documents; here are two
variants:

let $db := 'theocom'
let $keywords := 'apple'
let $search := function($doc) { $doc//text() contains text { $keywords } }
for $doc in db:get($db)[$search(.)]
order by ft:score($search($doc)) descending
return db:path($doc)

let $keywords := "apple"
for $texts score $score in db:get("theocom")//text()[. contains text {
$keywords }]
group by $uri := db:path($texts)
let $scores := sum($score)
order by $scores descending
return ($scores, $uri, $texts)

By evaluating specific score values for text nodes, you have more freedom
to decide how to interpret the scores. For example, you can rank scores of
title elements higher than those of paragraphs.

I invite you to have a look at our documentation for more information and
examples [1,2].

Hope this helps,
Christian

[1] https://docs.basex.org/wiki/Full-Text
[2] https://docs.basex.org/wiki/Indexes#Full-Text_Index



On Sat, Feb 17, 2024 at 1:23 PM Murray, Gregory 
wrote:

> Hello,
>
>
>
> I have a database with several thousand XML documents, although I have
> tens of thousands I’d like to add. Each XML document contains a book — both
> the bibliographic metadata such as title, author, etc. (each in its own
> element) and the complete OCR text of all pages of the book. Each page of
> text from each book is in a  element with a single text node
> containing all words from that page in the book, resulting in large blocks
> of text.
>
>
>
> I’ve added a full-text index and optimized it. I am finding that full-text
> searching is very slow. The query shown below consistently takes about 20
> seconds to run, even though there are only about 7400 documents. Obviously
> that’s far too slow to use the query in a web application, where the user
> expects a quick response.
>
>
>
> My first thought is whether the query is actually using the full-text
> index. Is there a way for me to determine that?
>
>
>
> I’m also wondering if my query is crude or is missing something. I don’t
> need the text nodes containing the search words; I only need to know which
> documents contain the words.
>
>
>
> let $keywords := "apple"
>
> for $doc in collection("theocom")
>
> let score $score := $doc contains text {$keywords}
>
> order by $score descending
>
> where $score > 0
>
> return concat($score, " ", base-uri($doc))
>
>
>
> As you can see, I’m searching all text in the entirety of each book. Is
> there a way to rewrite such a query for faster performance?
>
>
>
> Also, I’m wondering if the structure of the XML documents is such that the
> documents themselves need to have smaller blocks of text. For example, if
> the OCR text were contained in  elements, each containing only a
> single line of text, as printed in the original physical book, would
> full-text searching be noticeably faster, since each text node is much
> smaller?
>
>
>
> Thanks,
>
> Greg
>
>
>


Re: [basex-talk] Slow full-text querying

2024-02-18 Thread Murray, Gregory
Hi Christian,

Thanks very much for your quick reply and all the information and examples. I 
have found that the first example you provided that includes scoring takes 
about 10 seconds with my database, whereas the second example that includes 
scoring is nearly instantaneous (33 ms). I have no idea why that is the case, 
but that second query is blowing my mind! Amazing! I will experiment further.

Many thanks,
Greg

From: Christian Grün 
Date: Sunday, February 18, 2024 at 1:51 PM
To: Murray, Gregory 
Cc: basex-talk@mailman.uni-konstanz.de 
Subject: Re: [basex-talk] Slow full-text querying
Dear Greg,

In BaseX, it’s the text nodes that are indexed. Here are some query that take 
advantage of the full-text index:

db:get("theocom")[.//text() contains text 'apple']
db:get("theocom")//page[text() contains text 'apple']
db:get("theocom")//text()[. contains text 'apple']
...

If you check out the output of the GUI’s info panel, you’ll see whether the 
full-text index is applied.

There are several ways to compute scores for documents; here are two variants:

let $db := 'theocom'
let $keywords := 'apple'
let $search := function($doc) { $doc//text() contains text { $keywords } }
for $doc in db:get($db)[$search(.)]
order by ft:score($search($doc)) descending
return db:path($doc)

let $keywords := "apple"
for $texts score $score in db:get("theocom")//text()[. contains text { 
$keywords }]
group by $uri := db:path($texts)
let $scores := sum($score)
order by $scores descending
return ($scores, $uri, $texts)

By evaluating specific score values for text nodes, you have more freedom to 
decide how to interpret the scores. For example, you can rank scores of title 
elements higher than those of paragraphs.

I invite you to have a look at our documentation for more information and 
examples [1,2].

Hope this helps,
Christian

[1] https://docs.basex.org/wiki/Full-Text
[2] https://docs.basex.org/wiki/Indexes#Full-Text_Index



On Sat, Feb 17, 2024 at 1:23 PM Murray, Gregory 
mailto:gregory.mur...@ptsem.edu>> wrote:
Hello,

I have a database with several thousand XML documents, although I have tens of 
thousands I’d like to add. Each XML document contains a book — both the 
bibliographic metadata such as title, author, etc. (each in its own element) 
and the complete OCR text of all pages of the book. Each page of text from each 
book is in a  element with a single text node containing all words from 
that page in the book, resulting in large blocks of text.

I’ve added a full-text index and optimized it. I am finding that full-text 
searching is very slow. The query shown below consistently takes about 20 
seconds to run, even though there are only about 7400 documents. Obviously 
that’s far too slow to use the query in a web application, where the user 
expects a quick response.

My first thought is whether the query is actually using the full-text index. Is 
there a way for me to determine that?

I’m also wondering if my query is crude or is missing something. I don’t need 
the text nodes containing the search words; I only need to know which documents 
contain the words.

let $keywords := "apple"
for $doc in collection("theocom")
let score $score := $doc contains text {$keywords}
order by $score descending
where $score > 0
return concat($score, " ", base-uri($doc))

As you can see, I’m searching all text in the entirety of each book. Is there a 
way to rewrite such a query for faster performance?

Also, I’m wondering if the structure of the XML documents is such that the 
documents themselves need to have smaller blocks of text. For example, if the 
OCR text were contained in  elements, each containing only a single line 
of text, as printed in the original physical book, would full-text searching be 
noticeably faster, since each text node is much smaller?

Thanks,
Greg