Thanks, Christian. You are right about the tokenization of ampersands. However,
I still see unexpected behavior with the built-in stop words.
1. This works (using your clever stop word workaround, slightly modified with
string-join):
let $sw := map:merge(
for $sw in file:read-text-lines('stopwords.txt')
return map { $sw : true() }
)
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.'
let $t2 := 'Frontier Science and Technology Research Foundation, Inc.'
let $q1 := string-join(ft:tokenize($t1)[not($sw(.))], ' ')
let $q2 := string-join(ft:tokenize($t2)[not($sw(.))], ' ')
where $q1 contains text { $q2 }
return <r> { <q1> { $q1 } </q1>, <q2> { $q2 } </q2> } </r>
2. This fails:
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.'
let $t2 := 'Frontier Science and Technology Research Foundation, Inc.'
where $t1 contains text { $t2 } using stop words at 'stopwords.txt' or
$t2 contains text { $t1 } using stop words at 'stopwords.txt'
return <r> { <q1> { $t1 } </q1>, <q2> { $t2 } </q2> } </r>
Any idea why?
Thanks,
Ron
On February 2, 2016 at 12:13:14 PM, Christian Grün ([email protected])
wrote:
Hi Ron,
I’m pretty sure that the default tokenizer discards the ampersand and
doesn’t pass it on as token at all.
Hope this helps (…at least for understanding the query result),
Christian
On Tue, Feb 2, 2016 at 6:10 PM, Ron Katriel <[email protected]> wrote:
> Hi,
>
> Given this thesaurus entry
>
> <thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus">
> <entry>
> <term>&</term>
> <synonym>
> <term>and</term>
> <relationship>USE</relationship>
> </synonym>
> </entry>
> </thesaurus>
>
> I was expecting the following query to return true (file path omitted for
> clarify)
>
> 'Frontier Science and Technology Research Foundation, Inc.' contains text
> 'Frontier Science & Technology Research Foundation, Inc.' using
> thesaurus at "thesaurus.xml”
>
> but it returns false. Switching the order of the term and synonym makes no
> difference.
>
> I tried getting around this using a stop word file (which includes ‘and’,
> ‘&’, and '&’, just in case) but it does not work either.
>
> Am I missing something?
>
> Thanks,
> Ron
>