[ https://issues.apache.org/jira/browse/LUCENE-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555890#comment-16555890 ]
Adrien Grand commented on LUCENE-8060: -------------------------------------- Our current hit count estimations are terrible. I don't think any user would want to rely on them or even display them in a UI. Problem is that hit counts are useful from a UI perspective, for instance for pagination, or to improve the user experience by giving users a sense of how many matches there are and giving confidence in the search engine by showing the user that there is a lot of content that matches his query. I think an ok trade-off that would address the two above use-cases would be to only count up to a certain hit count? For instance if you allow users to paginate up to page 10 and have 20 hits per page, you only need to count up to 200 hits to know how many pages to display. Similarly if your end goal is only to show users that you have lots of content, you could only count up to eg. 10,000 matches and show something like "more than 10,000 hits" in the UI if that number is reached. In both cases, this should help keep the counting overhead contained so that it doesn't end up being the bottleneck of query processing? I believe both TopScoreDocCollector and TopFieldCollector could easily be changed in order to replace `boolean trackTotalHits` with something like `int maxTotalHits` and we would stop counting after visiting maxTotalHits documents? Regarding integration in IndexSearcher, I am thinking of 3 ideas: - hardcode a value for this parameter, maybe 10,000 and rename TopDocs.totalHits to make sure users get a compile error - add a parameter to the search() methods to require users to pass a maxTotalHits - add a required constructor argument to IndexSearcher that would affect all search() methods We could also make the top docs collectors just compute a ScoreDoc[] (ie. no total hits) and require users to compute the hit count separately, but I'm concerned that it would make simple usage of Lucene harder. Opinions? > Require users to tell us whether they need total hit counts > ----------------------------------------------------------- > > Key: LUCENE-8060 > URL: https://issues.apache.org/jira/browse/LUCENE-8060 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Fix For: master (8.0) > > > We are getting optimizations when hit counts are not required (sorted > indexes, MAXSCORE, short-circuiting of phrase queries) but our users won't > benefit from them unless we disable exact hit counts by default or we require > them to tell us whether hit counts are required. > I think making hit counts approximate by default is going to be a bit trappy, > so I'm rather leaning towards requiring users to tell us explicitly whether > they need total hit counts. I can think of two ways to do that: either by > passing a boolean to the IndexSearcher constructor or by adding a boolean to > all methods that produce TopDocs instances. I like the latter better but I'm > open to discussion or other ideas? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org