Mike, that is true that relevance is calculated based on fragments.  But I do 
not think that will be a problem here.

The distance-weight feature will also take that into account.

The solutions I pointed to try to make the content with the phrases in the 
titles, the words in the title, and the other words being closer to them, have 
higher relevancy.  The order by edit-distance Mike suggests I think might be 
problematic because you would have to get all of the results first in order to 
compute it. 

But as I said earlier, I am not clear that I am understanding exactly what Jeni 
is trying to accomplish.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Sokolov
Sent: Thursday, April 07, 2011 6:32 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Getting closest matches

I doubt you will get what you are after using standard relevance 
calculations based on term frequency since the frequencies are 
calculated for the whole fragment (I think that's right - Danny can you 
confirm?) rather than the particular element being searched.

You might consider implementing an edit-distance measure in xquery (or 
in some enclosing application layer if you have one) and sorting results 
with equal relevance weight as a secondary sort criterion:

for $result in search(...)
   order by cts:score($result), edit-distance($query, $result)
  ...

-Mike

On 4/7/2011 4:33 PM, Jeni Tennison wrote:
> Hi,
>
> My question is about trying to get back search results that favour the lowest 
> edit-distance between a search phrase and the content of an element.
>
> I'm dealing with a large set of legislation, and many items within this set 
> have very similar titles. For example, there are three items named:
>
>    * National Health Service (Optical Charges and Payments) and (General 
> Ophthalmic Services) (Amendment) (Wales) Regulations 2001
>    * National Health Service (Optical Charges and Payments) and (General 
> Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001
>    * The National Health Service (Optical Charges and Payments) and (General 
> Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001
>
> In general, I want to do a keyword search on these titles, so that a search 
> for "National Health Service" will bring back all three of the above; in this 
> case I don't particularly care about the order as they're all likely to be of 
> relevance.
>
> However, if I search for a full title, I want to make sure that the first 
> result is the one that matches that title best. That's easy if the title 
> exactly matches (or exactly matches with stemming variants): I have:
>
>    cts:or-query((
>      cts:element-value-query(xs:QName('dc:title'), $title, (), 10),
>      ... more complex keyword-based search with lower weight ...
>    ))
>
> but I'm running into problems in the case where the match isn't a precise 
> one. A search for:
>
>    "National Health Service (Optical Charges&  Payments) and (General 
> Ophthalmic Services) (Amendment) (Wales) Regulations 2001"
>
> doesn't match any of the titles exactly because it's got a '&' rather than a 
> 'and', but it should still match (I exclude stop-words from the search) and 
> bring back the first in the above as the highest priority, because it's the 
> closest match to the string -- it doesn't contain an additional "(No. 2)" or 
> "The".
>
> So my question is how can I achieve this? Is there any way of ordering based 
> on edit distance? Or of including a negative-weighted query that would mean a 
> lower score to elements that contain additional terms?
>
> Any ideas appreciated,
>
> Jeni

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to