Re: [basex-talk] Fulltext Distancing

Christian Grün Wed, 16 Sep 2020 05:01:52 -0700

Hi Adam,

Thanks for writing to the list.

After having given you a quick reply in private, I have double-checked
your new use cases, and once again I realized that it’s the complex
specification rules that leads to behavior that’s difficult to grasp.
I’ll try to make it short:

This query returns true:

'A x B' contains text { 'A', 'B' }
all distance at most 1 words

The following query returns false because 'A B' is treated as a single
search term:

'A x B' contains text 'A B'
all distance at most 1 words

The following query returns true. It’s actually equivalent to the
first query: Due to “all words”, the single string will be tokenized
into independent search terms.

Things get freaky with the next use case:

'A x B x x B' contains text { 'A', 'B' }
all distance at most 1 words

The query creates three string matches: 1 for “A” and 2 for ”B”. The
specification states: “When a distance selection applies a distance
condition to more than two matches, the distance condition is required
to hold on each successive pair of matches.” [1]. In our case, the
rule does not hold on the last pair (the distance between “B” and “B”
is too large).

There is one way out, and it’s the usage of 'ftand':

'A x B x x B' contains text 'A' ftand 'B'
all words distance at most 1 words

This query will return two “full-text matches”, each containing two
“string matches”, and the check will be successful if at least one
full-text match is successul. In this case, it’s the first full-text
match, which contains string-matches for “A” and “B”, which are at
most one word distant from each other.

Confused? I guess so ;) The current implementation of ft:search does
not allow you to explicitly combine search strings with
ftand/ftor/ftnot. I’ll have some more thoughts on what we could do
here. Until then, feel free to try your luck with the static full-text
syntax.

Cheers,
Christian

[1] https://www.w3.org/TR/xpath-full-text-10/#ftdistance

On Wed, Sep 16, 2020 at 12:13 PM Adam Law <adamjames...@gmail.com> wrote:
>
> (:
> Hello - I saw some discussions about full text indexing.  Dumbo here cannot 
> work out how distance matching works.
> Also is it somehow possible to traverse up and down a full-index word list 
> from a hit position rather than having to spend time say reversing strings.  
> Is this is not possible due to how the word indexing works?  If so can I 
> preprocess 
> https://github.com/pierrec/lz4/blob/master/fuzz/corpus/Mark.Twain-Tom.Sawyer_long.txt
>  to break it into words.
> I was considering making something like DTSearch (but more flexible) before I 
> realised how difficult this is.
> Many thanks
> Dumbo
> :)
>
> import module namespace functx = 'http://www.functx.com';
> (:fulltext <x><content><![CDATA[ 
> https://github.com/pierrec/lz4/blob/master/fuzz/corpus/Mark.Twain-Tom.Sawyer_long.txt
>  ]]></content></x>:)
>
> (:$f - return all nodes containing queer and enterprises:)
> let $f := <x>{ft:search('xvue_textIndex',("enterprises", "queer"), map { 
> 'mode': 'all' })/parent::*  }</x>
>
> let $options1 := map { 'mode': 'all'}
> let $options2 := map { 'mode': 'all', "distance": map { "max": "5","unit": 
> "words"  }}
> let $options3 := map { 'mode': 'all words', "distance": map { "max": 
> "5","unit": "words" }}
> let $options4 := map { 'mode': 'all words', "distance": map { "max": 5, 
> "unit": "words" }}
> let $options := $options1 (:Why wont others work:)
>
> (:$g - mark words queers and enterprises.  Can't get options:2,3,4 to work:)
> let $g := <y>{ft:mark( $f//*[ft:contains(text(), ('queer','enterprises'), 
> $options)], 'mark')}</y>
>
> (:Hopeful - with distancing will this result in <mark>queer 
> enterprises</mark>.  Otherwise I have to postprocess more:)
>
> (:Unsure about how to return words before and words after using fulltext.  
> Have to limit to characters:)
> (:Ideally, I would like to be able to specify words after and before:)
> let $charbefore := 30
> let $charafter := 30
>
> (:This takes a while because I am string joining large 
> preceding-sibling:nodes() (sometimes text() and sometimes marked/text()) to 
> return words in context:)(:Three seconds:)
> (:Is there a fulltext way of doing this that is faster eg traverse a word 
> list by match position:)
> let $h := for $w in $g//mark
>             return <a><preceedingWords>{
>               functx:reverse-string (substring(
>               
> functx:reverse-string(string-join($w/preceding-sibling::node())),0,$charbefore))}</preceedingWords><match>{$w/text()}</match><followingWords>{substring(string-join($w/following-sibling::node()
>  ),0,$charafter)}</followingWords></a>
>
> return $h
>
> (:
> $h :=
> <a>
>   <preceedingWords>thought and talked,
> and what </preceedingWords>
>   <match>queer</match>
>   <followingWords> enterprises they sometimes e</followingWords>
> </a>
> <a>
>   <preceedingWords>t and talked,
> and what queer </preceedingWords>
>   <match>enterprises</match>
>   <followingWords> they sometimes engaged in.
>
> </followingWords>
> </a>
>
> Sorry about the length of this.
> :)

Re: [basex-talk] Fulltext Distancing

Reply via email to