I doubt you will get what you are after using standard relevance
calculations based on term frequency since the frequencies are
calculated for the whole fragment (I think that's right - Danny can you
confirm?) rather than the particular element being searched.
You might consider implementing an edit-distance measure in xquery (or
in some enclosing application layer if you have one) and sorting results
with equal relevance weight as a secondary sort criterion:
for $result in search(...)
order by cts:score($result), edit-distance($query, $result)
...
-Mike
On 4/7/2011 4:33 PM, Jeni Tennison wrote:
> Hi,
>
> My question is about trying to get back search results that favour the lowest
> edit-distance between a search phrase and the content of an element.
>
> I'm dealing with a large set of legislation, and many items within this set
> have very similar titles. For example, there are three items named:
>
> * National Health Service (Optical Charges and Payments) and (General
> Ophthalmic Services) (Amendment) (Wales) Regulations 2001
> * National Health Service (Optical Charges and Payments) and (General
> Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001
> * The National Health Service (Optical Charges and Payments) and (General
> Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001
>
> In general, I want to do a keyword search on these titles, so that a search
> for "National Health Service" will bring back all three of the above; in this
> case I don't particularly care about the order as they're all likely to be of
> relevance.
>
> However, if I search for a full title, I want to make sure that the first
> result is the one that matches that title best. That's easy if the title
> exactly matches (or exactly matches with stemming variants): I have:
>
> cts:or-query((
> cts:element-value-query(xs:QName('dc:title'), $title, (), 10),
> ... more complex keyword-based search with lower weight ...
> ))
>
> but I'm running into problems in the case where the match isn't a precise
> one. A search for:
>
> "National Health Service (Optical Charges& Payments) and (General
> Ophthalmic Services) (Amendment) (Wales) Regulations 2001"
>
> doesn't match any of the titles exactly because it's got a '&' rather than a
> 'and', but it should still match (I exclude stop-words from the search) and
> bring back the first in the above as the highest priority, because it's the
> closest match to the string -- it doesn't contain an additional "(No. 2)" or
> "The".
>
> So my question is how can I achieve this? Is there any way of ordering based
> on edit distance? Or of including a negative-weighted query that would mean a
> lower score to elements that contain additional terms?
>
> Any ideas appreciated,
>
> Jeni
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general