I think traditionally stop words have also been removed from queries in
order to reduce the number of irrelevant results, as an efficiency
measure. A system that has trouble handling a query which could in
theory return millions of results would want to try to prevent this by
disabling queries for "the" and the like.
But I think there are better approaches to solving that problem
available to us now, and I am coming around to the view that stop-word
removal is not really useful if relevance calculations are functioning
properly.
-Mike
Danny Sokolsky wrote:
Hi Tim,
There is nothing wrong with using stop words, if that makes sense for
your application. I was trying to just suggest that you ask the
question and run some tests to see if removing stop words really makes
a difference in your application. I think it is highly
application-specific.
As far as the relevancy question, if you add terms that appear in most
every document to a search, because the relevance is calculated based
on the term frequency and the total number of documents, if you have a
sufficiently large database (a large total number of documents), then
you will tend to get the same documents back from that search in
approximately the same order, with or without stop words. Again, your
mileage may vary, and this can be very content-specific.
Partly, it comes down to this: is it better to answer the exact
question (query) that was asked or to infer what the user means by the
question they asked? So it seems to me it is an application issue.
-Danny
*From:* [email protected]
[mailto:[email protected]] *On Behalf Of *Tim
Meagher
*Sent:* Tuesday, September 01, 2009 11:17 AM
*To:* 'General Mark Logic Developer Discussion'
*Subject:* RE: [MarkLogic Dev General] "Stop words" using Marklogic
Hi Danny,
I have a similar need for using stopwords. I can't just weight some
elements in my search higher than others because I'm dealing primarily
with variations of a critical search field, i.e., a serial publication
title. It seems to me that removing stopwords from the search value in
conjunction with using cts:element-word-query() is the most fruitful
way to improve match results. It could be that I don't fully
understand the MarkLogic options that are provided to use relevancy in
such a case.
Thanks,
Tim Meagher
AAOM Consulting
------------------------------------------------------------------------
*From:* [email protected]
[mailto:[email protected]] *On Behalf Of *Danny
Sokolsky
*Sent:* Tuesday, September 01, 2009 12:16 PM
*To:* General Mark Logic Developer Discussion
*Subject:* RE: [MarkLogic Dev General] "Stop words" using Marklogic
Hi Mano,
MarkLogic Server does not really have a concept of stop words, per se.
A term is a term, and all the terms in a query are used to calculate
relevance. The relevance is calculated based on the term frequency and
the number of fragments in the database, so words that are typically
thought of as “stop words” will not add much to the score of its
search results.
That being said, it is quite easy to have your application parse the
query text before generating a cts:query. For example, if your
application gets its text from users via a text box in a browser, you
can grab the text from the request and do an appropriate fn:replace on
the string, removing some list of stop words. I suspect for many stop
word lists, the performance of this would be fine, assuming the list
is not that large. Depending on how your application is written,
another approach might be to parse the query after you construct the
cts:query, removing unwanted terms. Each approach has advantages and
disadvantages.
Another question to ask yourself is this: do you really need to remove
the stop words? The main reason to remove them (it seems to me) is to
give more relevant answers, and I don’t think it will end up making
much difference for that. You might find better ways of improving your
relevance such as weighting some elements higher than others.
-Danny
*From:* [email protected]
[mailto:[email protected]] *On Behalf Of *mano m
*Sent:* Tuesday, September 01, 2009 6:59 AM
*To:* [email protected]
*Subject:* [MarkLogic Dev General] "Stop words" using Marklogic
Hi,
We need to implement "Stop words" in search application using
Marklogic. Will Mark Logic supports this through any API or do we need
to implement our own logic to achieve this?
Please share your ideas.
*Thanks,*
Mano
------------------------------------------------------------------------
See the Web's breaking stories, chosen by people like you. Check out
Yahoo! Buzz
<http://in.rd.yahoo.com/tagline_buzz_1/*http:/in.buzz.yahoo.com/>.
------------------------------------------------------------------------
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general