On Fri, Mar 7, 2014 at 9:01 PM, Da Huang <[email protected]> wrote:
> Hello, everyone,
>
> My name is Da Huang. I'm studying for my master degree of Computer Science
> in Peking University. I have been using lucene for about half a year. It's
> so elegent that I hope to have a chance to contribute some code for it.

Welcome!

> Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene for
> several days. I find "LUCENE-3333: Specialize DisjunctionScorer if all
> clauses are TermQueries" more suitable for me to do. I have spent some time
> to scan the revelant code, and the Issue "LUCENE-3328" which spinoff
> "LUCENE-3333". I find the following questions confusing me.
>
> 1) I have checkout the code from
> "http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk", but I
> couldn't find the relevant code of the fixed Issue "LUCENE-3328". It seems
> that the patch attached on the page is not on the trunk. Why?

Well, some time after LUCENE-3328, we made further changes and
discovered that this code specialized scorer was not in fact [that
much?] faster.  I forget which issue removed it... but you could
probably find it with some svn archaeology.

Net/net the trend in Lucene has been against adding source code
specialization, since this is really code duplication to try to make
hotspot's life easier.  Unfortunately, it does sometimes work; e.g see
http://blog.mikemccandless.com/2013/06/screaming-fast-lucene-searches-using-c.html
though that's not a fair comparison since it was also a different
programming language!  So, while it's a nice tradeoff for performance,
it's a poor tradeoff for ongoing code management.  See all the
specialized collectors we have in TopFieldCollector!

So I'm not sure at this point if we should even pursue LUCENE-3333.
There are however tons of other things to fix on the search side;
maybe we could craft a good GSoC project from something else; e.g.:

  - we should pass a needsScorers boolean up-front to Weight.scorer
  - disjunctions now score during matching

  - BooleanScorer should sometimes be used for MUST clauses

  - We sort of duplicate code across BooleanQuery, FilteredQuery,
BooleanFilter, TermsFilter

  - Somehow, Filter and Query should be more "combined"; e.g. you
should be able to add a Filter as a clause onto a BooleanQuery

  - "Post filtering" is too hard to use today

  - ...

> 2)  My intuitive idea of solving this issue is to make a class
> "DisjunctionTermScorer" to do the all TermQueries clauses; then, judging
> whether to use DisjunctionTermScorer in the method 'scorer' in class
> BooleanQuery. Is this idea right?

Yes this would be the right idea.

> Above are my questions about "LUCENE-3333". Besides, I would like to propose
> the following issue which is about the QueryParser.
>
> When we use QueryParser to parse a querystring like "science AND
> (engineering AND technology)". The generated query would be "+science
> (+engineering +technology)". I think it would be more efficient for
> searching if the final query is "+science +engineering +technology". My idea
> is to make the cascaded AND and cascaded OR flat. Do you agree? I hope I
> have made my idea clear.

I think this would make tons of sense; the only "challenge" is that
this will change how scores are computed, when coord is enabled.  I'm
not sure how much that'd matter in practice; if it is important to
preserve that, then maybe we could still make a single 3-clause
BooleanQuery, but somehow remember the original structure for the sake
of coord scoring ... not sure.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to