Re: [MarkLogic Dev General] Getting closest matches

Michael Sokolov Fri, 08 Apr 2011 04:08:26 -0700

I agree that the edit-distance sorting could be problematic; say if 
there are a large number of very similar results (100s rather than 2 or 
3 as in the example).   I suggested it though, because it provides a 
means to exercise complete control and achieve very specific results in 
a case that sounds like it has unusual requirements.  I do agree it's 
only worth resorting to this sort of thing after exhausting all other 
options, though.


-Mike

On 4/7/2011 9:45 PM, Danny Sokolsky wrote:
> Mike, that is true that relevance is calculated based on fragments.  But I do 
> not think that will be a problem here.
>
> The distance-weight feature will also take that into account.
>
> The solutions I pointed to try to make the content with the phrases in the 
> titles, the words in the title, and the other words being closer to them, 
> have higher relevancy.  The order by edit-distance Mike suggests I think 
> might be problematic because you would have to get all of the results first 
> in order to compute it.
>
> But as I said earlier, I am not clear that I am understanding exactly what 
> Jeni is trying to accomplish.
>
> -Danny
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Michael Sokolov
> Sent: Thursday, April 07, 2011 6:32 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Getting closest matches
>
> I doubt you will get what you are after using standard relevance
> calculations based on term frequency since the frequencies are
> calculated for the whole fragment (I think that's right - Danny can you
> confirm?) rather than the particular element being searched.
>
> You might consider implementing an edit-distance measure in xquery (or
> in some enclosing application layer if you have one) and sorting results
> with equal relevance weight as a secondary sort criterion:
>
> for $result in search(...)
>     order by cts:score($result), edit-distance($query, $result)
>    ...
>
> -Mike
>
> On 4/7/2011 4:33 PM, Jeni Tennison wrote:
>> Hi,
>>
>> My question is about trying to get back search results that favour the 
>> lowest edit-distance between a search phrase and the content of an element.
>>
>> I'm dealing with a large set of legislation, and many items within this set 
>> have very similar titles. For example, there are three items named:
>>
>>     * National Health Service (Optical Charges and Payments) and (General 
>> Ophthalmic Services) (Amendment) (Wales) Regulations 2001
>>     * National Health Service (Optical Charges and Payments) and (General 
>> Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001
>>     * The National Health Service (Optical Charges and Payments) and 
>> (General Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001
>>
>> In general, I want to do a keyword search on these titles, so that a search 
>> for "National Health Service" will bring back all three of the above; in 
>> this case I don't particularly care about the order as they're all likely to 
>> be of relevance.
>>
>> However, if I search for a full title, I want to make sure that the first 
>> result is the one that matches that title best. That's easy if the title 
>> exactly matches (or exactly matches with stemming variants): I have:
>>
>>     cts:or-query((
>>       cts:element-value-query(xs:QName('dc:title'), $title, (), 10),
>>       ... more complex keyword-based search with lower weight ...
>>     ))
>>
>> but I'm running into problems in the case where the match isn't a precise 
>> one. A search for:
>>
>>     "National Health Service (Optical Charges&   Payments) and (General 
>> Ophthalmic Services) (Amendment) (Wales) Regulations 2001"
>>
>> doesn't match any of the titles exactly because it's got a '&' rather than a 
>> 'and', but it should still match (I exclude stop-words from the search) and 
>> bring back the first in the above as the highest priority, because it's the 
>> closest match to the string -- it doesn't contain an additional "(No. 2)" or 
>> "The".
>>
>> So my question is how can I achieve this? Is there any way of ordering based 
>> on edit distance? Or of including a negative-weighted query that would mean 
>> a lower score to elements that contain additional terms?
>>
>> Any ideas appreciated,
>>
>> Jeni
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Getting closest matches

Reply via email to