[
https://issues.apache.org/jira/browse/LUCENE-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945088#comment-14945088
]
Adrien Grand commented on LUCENE-6828:
--------------------------------------
Since you opened this issue, I'm wondering if you have more information about
the use-case of users that use such large pages? I think that some of our users
that execute such requests are in fact trying to export a subset of their
indexes, in which case they don't even need sorted results so we don't need a
priority queue? And I'd be curious to understand more about the other ones.
Also since you're playing with priority queues at the moment, I remember
getting better results at sorting with a ternary heap than a regular heap, I
assume because it has better cache efficiency in spite of a worse runtime. And
some people experimented with making priority queues more cache-efficient, eg.
http://playfulprogramming.blogspot.it/2015/08/cache-optimizing-priority-queue.html
> Speed up requests for many rows
> -------------------------------
>
> Key: LUCENE-6828
> URL: https://issues.apache.org/jira/browse/LUCENE-6828
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 4.10.4, 5.3
> Reporter: Toke Eskildsen
> Priority: Minor
> Labels: memory, performance
>
> Standard relevance ranked searches for top-X results uses the HitQueue class
> to keep track of the highest scoring documents. The HitQueue is a binary heap
> of ScoreDocs and is pre-filled with sentinel objects upon creation.
> Binary heaps of Objects in Java does not scale well: The HitQueue uses 28
> bytes/element and memory access is scattered due to the binary heap algorithm
> and the use of Objects. To make matters worse, the use of sentinel objects
> means that even if only a tiny number of documents matches, the full amount
> of Objects is still allocated.
> As long as the HitQueue is small (< 1000), it performs very well. If top-1M
> results are requested, it performs poorly and leaves 1M ScoreDocs to be
> garbage collected.
> An alternative is to replace the ScoreDocs with a single array of packed
> longs, each long holding the score and the document ID. This strategy
> requires only 8 bytes/element and is a lot lighter on the GC.
> Some preliminary tests has been done and published at
> https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> These indicate that a long[]-backed implementation is at least 3x faster than
> vanilla HitDocs for top-1M requests.
> For smaller requests, such as top-10, the packed version also seems
> competitive, when the amount of matched documents exceeds 1M. This needs to
> be investigated further.
> Going forward with this idea requires some refactoring as Lucene is currently
> hardwired to the abstract PriorityQueue. Before attempting this, it seems
> prudent to discuss whether speeding up large top-X requests has any value?
> Paging seems an obvious contender for requesting large result sets, but I
> guess the two could work in tandem, opening up for efficient large pages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]