I agree that the edit-distance sorting could be problematic; say if there are a large number of very similar results (100s rather than 2 or 3 as in the example). I suggested it though, because it provides a means to exercise complete control and achieve very specific results in a case that sounds like it has unusual requirements. I do agree it's only worth resorting to this sort of thing after exhausting all other options, though.
-Mike On 4/7/2011 9:45 PM, Danny Sokolsky wrote: > Mike, that is true that relevance is calculated based on fragments. But I do > not think that will be a problem here. > > The distance-weight feature will also take that into account. > > The solutions I pointed to try to make the content with the phrases in the > titles, the words in the title, and the other words being closer to them, > have higher relevancy. The order by edit-distance Mike suggests I think > might be problematic because you would have to get all of the results first > in order to compute it. > > But as I said earlier, I am not clear that I am understanding exactly what > Jeni is trying to accomplish. > > -Danny > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Michael Sokolov > Sent: Thursday, April 07, 2011 6:32 PM > To: General MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Getting closest matches > > I doubt you will get what you are after using standard relevance > calculations based on term frequency since the frequencies are > calculated for the whole fragment (I think that's right - Danny can you > confirm?) rather than the particular element being searched. > > You might consider implementing an edit-distance measure in xquery (or > in some enclosing application layer if you have one) and sorting results > with equal relevance weight as a secondary sort criterion: > > for $result in search(...) > order by cts:score($result), edit-distance($query, $result) > ... > > -Mike > > On 4/7/2011 4:33 PM, Jeni Tennison wrote: >> Hi, >> >> My question is about trying to get back search results that favour the >> lowest edit-distance between a search phrase and the content of an element. >> >> I'm dealing with a large set of legislation, and many items within this set >> have very similar titles. For example, there are three items named: >> >> * National Health Service (Optical Charges and Payments) and (General >> Ophthalmic Services) (Amendment) (Wales) Regulations 2001 >> * National Health Service (Optical Charges and Payments) and (General >> Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001 >> * The National Health Service (Optical Charges and Payments) and >> (General Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001 >> >> In general, I want to do a keyword search on these titles, so that a search >> for "National Health Service" will bring back all three of the above; in >> this case I don't particularly care about the order as they're all likely to >> be of relevance. >> >> However, if I search for a full title, I want to make sure that the first >> result is the one that matches that title best. That's easy if the title >> exactly matches (or exactly matches with stemming variants): I have: >> >> cts:or-query(( >> cts:element-value-query(xs:QName('dc:title'), $title, (), 10), >> ... more complex keyword-based search with lower weight ... >> )) >> >> but I'm running into problems in the case where the match isn't a precise >> one. A search for: >> >> "National Health Service (Optical Charges& Payments) and (General >> Ophthalmic Services) (Amendment) (Wales) Regulations 2001" >> >> doesn't match any of the titles exactly because it's got a '&' rather than a >> 'and', but it should still match (I exclude stop-words from the search) and >> bring back the first in the above as the highest priority, because it's the >> closest match to the string -- it doesn't contain an additional "(No. 2)" or >> "The". >> >> So my question is how can I achieve this? Is there any way of ordering based >> on edit distance? Or of including a negative-weighted query that would mean >> a lower score to elements that contain additional terms? >> >> Any ideas appreciated, >> >> Jeni > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
