Hi Jeni,
I may not be understanding what you are asking here, but I'll give it a try.
Here are 3 ideas:
1) Have you thought about boosting the title element at index time? You can
add a weight specific element in the Word Query configuration of a database.
Then, when a search match is in that element, it is automatically boosted.
This will apply to all word-query searches against the database.
2) To get the full title part, you might try doing an or-query of your normal
word-query combined with the phrase query of the search, possible boosting the
weight of the phrase a bit. So if you have a search for "national health
services", it can be parsed into a query like the following (you will need word
positions enabled for this to be effective):
cts:or-query((
cts:and-query((
cts:word-query("national"),
cts:word-query("health"),
cts:word-query("services") )),
cts:word-query("national health services", "", 2) ))
3) If you are running 4.2, you can use the distance-weight option to boost
words that are close together.
Some of this is touched on in the Search Developer's Guide:
http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4.2doc/xml/search-dev-guide/relevance.xml%2334743
Hope that helps.
-Danny
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Jeni Tennison
Sent: Thursday, April 07, 2011 1:34 PM
To: General MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Getting closest matches
Hi,
My question is about trying to get back search results that favour the lowest
edit-distance between a search phrase and the content of an element.
I'm dealing with a large set of legislation, and many items within this set
have very similar titles. For example, there are three items named:
* National Health Service (Optical Charges and Payments) and (General
Ophthalmic Services) (Amendment) (Wales) Regulations 2001
* National Health Service (Optical Charges and Payments) and (General
Ophthalmic Services) (Amendment) (No.2) (Wales) Regulations 2001
* The National Health Service (Optical Charges and Payments) and (General
Ophthalmic Services) (Amendment) (No.3) (Wales) Regulations 2001
In general, I want to do a keyword search on these titles, so that a search for
"National Health Service" will bring back all three of the above; in this case
I don't particularly care about the order as they're all likely to be of
relevance.
However, if I search for a full title, I want to make sure that the first
result is the one that matches that title best. That's easy if the title
exactly matches (or exactly matches with stemming variants): I have:
cts:or-query((
cts:element-value-query(xs:QName('dc:title'), $title, (), 10),
... more complex keyword-based search with lower weight ...
))
but I'm running into problems in the case where the match isn't a precise one.
A search for:
"National Health Service (Optical Charges & Payments) and (General Ophthalmic
Services) (Amendment) (Wales) Regulations 2001"
doesn't match any of the titles exactly because it's got a '&' rather than a
'and', but it should still match (I exclude stop-words from the search) and
bring back the first in the above as the highest priority, because it's the
closest match to the string -- it doesn't contain an additional "(No. 2)" or
"The".
So my question is how can I achieve this? Is there any way of ordering based on
edit distance? Or of including a negative-weighted query that would mean a
lower score to elements that contain additional terms?
Any ideas appreciated,
Jeni
--
Jeni Tennison
http://www.jenitennison.com
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general