[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695853#action_12695853
]
Shai Erera commented on LUCENE-1575:
------------------------------------
Mike - I've been trying to see this performance hit, without indexing the full
wikipedia content, but failed. What I did is:
* Create an index with 10M documents with one term (all with the same term).
The index had 51 segments (I deliberately aimed at a high number).
* Run a sort-by-doc (in reverse order so that collect() will actually do some
work) search using TermQuery (so that TermScorer will be used).
* Measure the avg. time of 20 query executions (I deliberately omit the first
execution because I've noticed that the first run sometimes take much longer,
even after I kill the JVM. I guess it has to do with OS caches).
The time I measure between the trunk and the current version are very much
comparable:
trunk-patched: (1) 1248.45 ms (2) 1255.45 ms
trunk-orig: (1) 1314.1 ms (2) 1265.65 ms
In both runs the patched trunk ran faster. Both did not have any large
differences.
Here's the code:
{code}
public static void main(String[] args) throws Exception {
File indexDir = new File(args[0]);
Directory dir = new FSDirectory(indexDir, new
NativeFSLockFactory(indexDir));
if (!indexDir.exists() || indexDir.list().length == 0) {
int numDocs = 10000000;
IndexWriter writer = new IndexWriter(dir, null,
MaxFieldLength.UNLIMITED);
writer.setMergeFactor(50);
writer.setMaxBufferedDocs(numDocs/100);
System.out.println("populating index");
long time = System.currentTimeMillis();
for (int i = 0; i < numDocs; i++) {
Document doc = new Document();
doc.add(new Field("c", "test", Store.NO,
Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close(false);
System.out.println("time=" +
(System.currentTimeMillis() - time));
}
System.out.println("searching");
IndexSearcher searcher = new IndexSearcher(dir);
System.out.println("numSegments=" +
searcher.getIndexReader().getSequentialSubReaders().length);
Sort sort = new Sort(new SortField(null, SortField.DOC, true));
searcher.search(new TermQuery(new Term("c", "test")), null, 10,
sort);
long time = System.currentTimeMillis();
double numQueries = 20;
for (int i = 0; i < numQueries; i++) {
searcher.search(new TermQuery(new Term("c", "test")),
null, 10, sort);
}
time = System.currentTimeMillis() - time;
System.out.println("avg. time=" + (time / numQueries) + " ms");
searcher.close();
}
{code}
Can you try running this code on your machine?
> Refactoring Lucene collectors (HitCollector and extensions)
> -----------------------------------------------------------
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Shai Erera
> Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch,
> LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list.
> You can read the thread
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0,
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members.
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a
> convenience method when implementing a search application which allows paging
> through search results. It will also attempt to improve the memory
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs()
> and getTotalHits() implementations as they are from TopDocsCollector. The
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector,
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer,
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call
> scorer.score().
> If we do this, then we need to review all places in the code where
> collect(doc, score) is called, and assert whether Scorer can be passed. Also
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even
> possible?
> * I noticed that many (if not all) of the collect() implementations discard
> the document if its score is not greater than 0. Doesn't it mean that score
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but
> that was when we thought to call Colletor ResultsColletor. Since we decided
> (so far) on Collector, I think TopDocsCollector makes sense, because of its
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use
> Collector instead of HitCollector, as well as testing the new topDocs(start,
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in
> Collector (maybe even a different issue?)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]