I doubt you will get what you are after using standard relevance 
calculations based on term frequency since the frequencies are 
calculated for the whole fragment (I think that's right - Danny can you 
confirm?) rather than the particular element being searched.

You might consider implementing an edit-distance measure in xquery (or 
in some enclosing application layer if you have one) and sorting results 
with equal relevance weight as a secondary sort criterion:

for $result in search(...)
   order by cts:score($result), edit-distance($query, $result)
  ...

-Mike

On 4/7/2011 4:33 PM, Jeni Tennison wrote:
> Hi,
>
> My question is about trying to get back search results that favour the lowest 
> edit-distance between a search phrase and the content of an element.
>
> I'm dealing with a large set of legislation, and many items within this set 
> have very similar titles. For example, there are three items named:
>
>    * National Health Service (Optical Charges and Payments) and (General 
> Ophthalmic Services) (Amendment) (Wales) Regulations 2001
>    * National Health Service (Optical Charges and Payments) and (General 
> Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001
>    * The National Health Service (Optical Charges and Payments) and (General 
> Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001
>
> In general, I want to do a keyword search on these titles, so that a search 
> for "National Health Service" will bring back all three of the above; in this 
> case I don't particularly care about the order as they're all likely to be of 
> relevance.
>
> However, if I search for a full title, I want to make sure that the first 
> result is the one that matches that title best. That's easy if the title 
> exactly matches (or exactly matches with stemming variants): I have:
>
>    cts:or-query((
>      cts:element-value-query(xs:QName('dc:title'), $title, (), 10),
>      ... more complex keyword-based search with lower weight ...
>    ))
>
> but I'm running into problems in the case where the match isn't a precise 
> one. A search for:
>
>    "National Health Service (Optical Charges&  Payments) and (General 
> Ophthalmic Services) (Amendment) (Wales) Regulations 2001"
>
> doesn't match any of the titles exactly because it's got a '&' rather than a 
> 'and', but it should still match (I exclude stop-words from the search) and 
> bring back the first in the above as the highest priority, because it's the 
> closest match to the string -- it doesn't contain an additional "(No. 2)" or 
> "The".
>
> So my question is how can I achieve this? Is there any way of ordering based 
> on edit distance? Or of including a negative-weighted query that would mean a 
> lower score to elements that contain additional terms?
>
> Any ideas appreciated,
>
> Jeni

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to