I think traditionally stop words have also been removed from queries in order to reduce the number of irrelevant results, as an efficiency measure. A system that has trouble handling a query which could in theory return millions of results would want to try to prevent this by disabling queries for "the" and the like.

But I think there are better approaches to solving that problem available to us now, and I am coming around to the view that stop-word removal is not really useful if relevance calculations are functioning properly.

-Mike

Danny Sokolsky wrote:

Hi Tim,

There is nothing wrong with using stop words, if that makes sense for your application. I was trying to just suggest that you ask the question and run some tests to see if removing stop words really makes a difference in your application. I think it is highly application-specific.

As far as the relevancy question, if you add terms that appear in most every document to a search, because the relevance is calculated based on the term frequency and the total number of documents, if you have a sufficiently large database (a large total number of documents), then you will tend to get the same documents back from that search in approximately the same order, with or without stop words. Again, your mileage may vary, and this can be very content-specific.

Partly, it comes down to this: is it better to answer the exact question (query) that was asked or to infer what the user means by the question they asked? So it seems to me it is an application issue.

-Danny

*From:* [email protected] [mailto:[email protected]] *On Behalf Of *Tim Meagher
*Sent:* Tuesday, September 01, 2009 11:17 AM
*To:* 'General Mark Logic Developer Discussion'
*Subject:* RE: [MarkLogic Dev General] "Stop words" using Marklogic

Hi Danny,

I have a similar need for using stopwords. I can't just weight some elements in my search higher than others because I'm dealing primarily with variations of a critical search field, i.e., a serial publication title. It seems to me that removing stopwords from the search value in conjunction with using cts:element-word-query() is the most fruitful way to improve match results. It could be that I don't fully understand the MarkLogic options that are provided to use relevancy in such a case.

Thanks,

Tim Meagher

AAOM Consulting

------------------------------------------------------------------------

*From:* [email protected] [mailto:[email protected]] *On Behalf Of *Danny Sokolsky
*Sent:* Tuesday, September 01, 2009 12:16 PM
*To:* General Mark Logic Developer Discussion
*Subject:* RE: [MarkLogic Dev General] "Stop words" using Marklogic

Hi Mano,

MarkLogic Server does not really have a concept of stop words, per se. A term is a term, and all the terms in a query are used to calculate relevance. The relevance is calculated based on the term frequency and the number of fragments in the database, so words that are typically thought of as “stop words” will not add much to the score of its search results.

That being said, it is quite easy to have your application parse the query text before generating a cts:query. For example, if your application gets its text from users via a text box in a browser, you can grab the text from the request and do an appropriate fn:replace on the string, removing some list of stop words. I suspect for many stop word lists, the performance of this would be fine, assuming the list is not that large. Depending on how your application is written, another approach might be to parse the query after you construct the cts:query, removing unwanted terms. Each approach has advantages and disadvantages.

Another question to ask yourself is this: do you really need to remove the stop words? The main reason to remove them (it seems to me) is to give more relevant answers, and I don’t think it will end up making much difference for that. You might find better ways of improving your relevance such as weighting some elements higher than others.

-Danny

*From:* [email protected] [mailto:[email protected]] *On Behalf Of *mano m
*Sent:* Tuesday, September 01, 2009 6:59 AM
*To:* [email protected]
*Subject:* [MarkLogic Dev General] "Stop words" using Marklogic

Hi,

We need to implement "Stop words" in search application using Marklogic. Will Mark Logic supports this through any API or do we need to implement our own logic to achieve this?

Please share your ideas.

*Thanks,*

Mano

------------------------------------------------------------------------

See the Web's breaking stories, chosen by people like you. Check out Yahoo! Buzz <http://in.rd.yahoo.com/tagline_buzz_1/*http:/in.buzz.yahoo.com/>.

------------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to