On Fri, Mar 7, 2014 at 9:01 PM, Da Huang <[email protected]> wrote: > Hello, everyone, > > My name is Da Huang. I'm studying for my master degree of Computer Science > in Peking University. I have been using lucene for about half a year. It's > so elegent that I hope to have a chance to contribute some code for it.
Welcome! > Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene for > several days. I find "LUCENE-3333: Specialize DisjunctionScorer if all > clauses are TermQueries" more suitable for me to do. I have spent some time > to scan the revelant code, and the Issue "LUCENE-3328" which spinoff > "LUCENE-3333". I find the following questions confusing me. > > 1) I have checkout the code from > "http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk", but I > couldn't find the relevant code of the fixed Issue "LUCENE-3328". It seems > that the patch attached on the page is not on the trunk. Why? Well, some time after LUCENE-3328, we made further changes and discovered that this code specialized scorer was not in fact [that much?] faster. I forget which issue removed it... but you could probably find it with some svn archaeology. Net/net the trend in Lucene has been against adding source code specialization, since this is really code duplication to try to make hotspot's life easier. Unfortunately, it does sometimes work; e.g see http://blog.mikemccandless.com/2013/06/screaming-fast-lucene-searches-using-c.html though that's not a fair comparison since it was also a different programming language! So, while it's a nice tradeoff for performance, it's a poor tradeoff for ongoing code management. See all the specialized collectors we have in TopFieldCollector! So I'm not sure at this point if we should even pursue LUCENE-3333. There are however tons of other things to fix on the search side; maybe we could craft a good GSoC project from something else; e.g.: - we should pass a needsScorers boolean up-front to Weight.scorer - disjunctions now score during matching - BooleanScorer should sometimes be used for MUST clauses - We sort of duplicate code across BooleanQuery, FilteredQuery, BooleanFilter, TermsFilter - Somehow, Filter and Query should be more "combined"; e.g. you should be able to add a Filter as a clause onto a BooleanQuery - "Post filtering" is too hard to use today - ... > 2) My intuitive idea of solving this issue is to make a class > "DisjunctionTermScorer" to do the all TermQueries clauses; then, judging > whether to use DisjunctionTermScorer in the method 'scorer' in class > BooleanQuery. Is this idea right? Yes this would be the right idea. > Above are my questions about "LUCENE-3333". Besides, I would like to propose > the following issue which is about the QueryParser. > > When we use QueryParser to parse a querystring like "science AND > (engineering AND technology)". The generated query would be "+science > (+engineering +technology)". I think it would be more efficient for > searching if the final query is "+science +engineering +technology". My idea > is to make the cascaded AND and cascaded OR flat. Do you agree? I hope I > have made my idea clear. I think this would make tons of sense; the only "challenge" is that this will change how scores are computed, when coord is enabled. I'm not sure how much that'd matter in practice; if it is important to preserve that, then maybe we could still make a single 3-clause BooleanQuery, but somehow remember the original structure for the sake of coord scoring ... not sure. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
