Jeni,

That's a pretty interesting requirement. I played with it and find you can at 
least approximate that sort of thing by checking for all adjacent words. 

See the query below. I have both "fast phrase" and "fast element phrase" search 
indexes on and extra words do lower the score. My intuition is that there will 
be a performance penalty, however, and I'm not sure what flavor of edit 
distance you're looking for. You may also look at near-query()'s with short 
distances of 2 or 3, particularly if you drop an intervening stop word, or want 
to use 4.2's distance-weight as well.

Damon


let $d1 := <testprox>The quick brown or yellow fox jumped high over the 
dog</testprox>
let $d2 := <testprox>The quick brown fox jumped over the yellow dog</testprox>
return (
  xdmp:document-insert("/test/test1.xml", $d1),
  xdmp:document-insert("/test/test2.xml", $d2)
)

; (: transaction separator :)

let $q := 
cts:element-query(xs:QName("testprox"), 
cts:or-query((
  "the quick",
  "quick brown",
  "brown fox",
  "fox jumped",
  "jumped over",
  "over the",
  "the yellow"
))
)

return cts:search(doc(), $q, "score-simple")/concat(base-uri(.), ": ", 
cts:score(.))
 
-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Jeni Tennison
Sent: Thursday, April 07, 2011 4:34 PM
To: General MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Getting closest matches

Hi,

My question is about trying to get back search results that favour the lowest 
edit-distance between a search phrase and the content of an element.

I'm dealing with a large set of legislation, and many items within this set 
have very similar titles. For example, there are three items named:

  * National Health Service (Optical Charges and Payments) and (General 
Ophthalmic Services) (Amendment) (Wales) Regulations 2001
  * National Health Service (Optical Charges and Payments) and (General 
Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001
  * The National Health Service (Optical Charges and Payments) and (General 
Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001

In general, I want to do a keyword search on these titles, so that a search for 
"National Health Service" will bring back all three of the above; in this case 
I don't particularly care about the order as they're all likely to be of 
relevance.

However, if I search for a full title, I want to make sure that the first 
result is the one that matches that title best. That's easy if the title 
exactly matches (or exactly matches with stemming variants): I have:

  cts:or-query((
    cts:element-value-query(xs:QName('dc:title'), $title, (), 10),
    ... more complex keyword-based search with lower weight ...
  ))

but I'm running into problems in the case where the match isn't a precise one. 
A search for:

  "National Health Service (Optical Charges & Payments) and (General Ophthalmic 
Services) (Amendment) (Wales) Regulations 2001"

doesn't match any of the titles exactly because it's got a '&' rather than a 
'and', but it should still match (I exclude stop-words from the search) and 
bring back the first in the above as the highest priority, because it's the 
closest match to the string -- it doesn't contain an additional "(No. 2)" or 
"The".

So my question is how can I achieve this? Is there any way of ordering based on 
edit distance? Or of including a negative-weighted query that would mean a 
lower score to elements that contain additional terms?

Any ideas appreciated,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to