Sorry, I was at Lucene Revolution, and out of circulation for a while. I agree with your point that "it depends" (what correct behavior is). Seems like a lot of Solr/Lucene has that answer...
But as to stop words, the trend lately is to NOT remove them anyway. The whole stop word issue comes from a long time ago when disk space was much more dear. As to how much it increases the index size, I don't have a clue off the top of my head... Best Erick 2011/5/23 Mindaugas Žakšauskas <min...@gmail.com>: > Hi Erick, > > I think answer to this question depends which hat you put on. > > If you put search engine hat (or do similar things in, i.e. Google), > the results will be the same as what Lucene does at the moment. And > that's fair enough - getting more results in search engine world is > almost always better than getting less. Even if a bunch of slightly > irrelevant results is returned, nobody cares. > > But if you put a database hat, the world view suddenly changes. I am > sure there are plenty of people who use Lucene in situations where > they need exact matches and any excess results are not desirable. > > The root of the evil here is coming from the fact that stopwords are > not indexed and reasonable defaults have to be assumed in different > situations. Thinking of this, to return all data for stopword-only > query would probably be least expected and I don't disagree on your > argument about the mixed case, too. > > This probably leaves me with a single option which is not to use > stopwords at all, allowing me to get the best of the both worlds. Does > anyone have any experience on how much of increased index size > (roughly) can I expect? > > Regards, > Mindaugas > > On Mon, May 23, 2011 at 3:13 PM, Erick Erickson <erickerick...@gmail.com> > wrote: >> Hmmm, somehow I missed this days ago.... >> >> Anyway, the Lucene query parsing process isn't quite Boolean logic. >> I encourage you to think in terms of "required", "optional", and >> "prohibited". >> >> Both queries are equivalent, to see this try attaching &debugQuery=on >> to your URL and look at the "parsed query" in the debug info.... >> >> Anyway, to your qestion. >> +foo:bar +baz:"there is" >> >> reads that "bar" must appear in the field "foo". So far so good. >> But it's also required that baz contain the empty clause, which >> is different than saying baz must be empty. One can argue that >> any field contains, by definition, nothing. >> >> But imagine the impact of what you're requesting. If all stop words >> get removed, then no query would ever match yours. Which >> would be very counter-intuitive IMO. Your users have no clue >> that you've removed stopwords, so they'll sit there saying "Look, I >> KNOW that "bar" was in foo and I KNOW that "there is" was in >> baz, why the heck didn't this cursed system find my doc? >> >> Anyway, I don't think you really want this behavior in the >> stopword removal case. If you can post some use-cases where this >> would be desirable, maybe we can noodle about a solution.... >> >> Best >> Erick >> >> >> 2011/5/23 Mindaugas Žakšauskas <min...@gmail.com>: >>> Not much luck so far :( >>> >>> Just in case if anyone wants to earn some virtual dosh, I have added >>> some 50 bonus points to this question on StackOverflow: >>> >>> http://stackoverflow.com/questions/6H044061/lucene-query-parsing-behaviour-joining-query-parts-with-and >>> >>> I also promise to post a solution here if anything satisfactory turns up. >>> >>> m. >>> >>> 2011/5/17 Mindaugas Žakšauskas <min...@gmail.com>: >>>> Hi, >>>> >>>> Let's say we have an index having few documents indexed using >>>> StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries: >>>> 1) foo:bar >>>> 2) baz:"there is" >>>> >>>> Let's assume that the first query yields some results because there >>>> are documents matching that query. >>>> >>>> The second query contains two stopwords ("there" and "is") and yields >>>> 0 results. The reason for this is because when baz:"there is" is >>>> parsed, it ends up as a void query as both "there" and "is" are >>>> stopwords (technically speaking, this is converted to an empty >>>> BooleanQuery having no clauses). So far so good. >>>> >>>> However, any of the following combined queries >>>> >>>> +foo:bar +baz:"there is" >>>> foo:bar AND baz:"there is" >>>> >>>> behave exactly the same way as query +foo:bar, that is, brings back >>>> some results. The second AND part which is supposed to yield no >>>> results is completely ignored. >>>> >>>> One might argue that when ANDing both conditions have to be met, that >>>> is, documents having foo=bar and baz being empty have to be retrieved, >>>> as when issued seperately, baz:"there is" yields 0 results. >>>> >>>> It seem contradictory as an atomic query component has different >>>> impact on the overall query depending on the context. Is there any >>>> logical explanation for this? Can this be addressed in any way, >>>> preferably without writing own QueryAnalyzer? >>>> >>>> If this makes any difference, observed behaviour happens under Lucene >>>> v3.0.2. >>>> >>>> Regards, >>>> Mindaugas >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org