[
https://issues.apache.org/jira/browse/LUCENE-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077751#comment-14077751
]
Michael McCandless commented on LUCENE-5854:
--------------------------------------------
Wow, this would be a nice speedup :)
> BytesRefHash can be simplified (and made faster)
> ------------------------------------------------
>
> Key: LUCENE-5854
> URL: https://issues.apache.org/jira/browse/LUCENE-5854
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Dawid Weiss
> Priority: Minor
> Fix For: 5.0
>
>
> Currently BytesRefHash stores the length of each byte sequence as either one
> or two bytes inside the byte pool. This is redundant (slows down add
> operation and increases the required memory).
> Logically, what BytesRefHash does is assign linear IDs (0..n) to each unique
> byte sequence on input. So what's really needed are two data structures:
> * a compact list of byte sequences (indexed 0...n),
> * a hash table lookup of hash(byteSequence) -> ID.
> The first item is already implemented efficiently in BytesRefArray. Note that
> the length of each byte sequence is implicitly stored as the difference in
> starting offsets between the next sequence's start offset and the current
> sequence (clever!). This doesn't allow for removals, but saves on encoding
> and representation.
> The second bullet point above is trivial (linear hash table of IDs or -1
> indicating empty slots).
> I have a clear-room implementation of the above (based on HPPC data
> structures though) and it does show some performance improvement (on
> simplistic randomized data benchmarks).
> {code}
> BenchmarkByteBlockOrdSet.testBytesRefHash: [measured 5 out of 7 rounds,
> threads: 1 (sequential)]
> round: 2.62 [+- 0.03], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+-
> 0.00], GC.calls: 5, GC.time: 0.20, time.total: 18.43, time.warmup: 5.35,
> time.bench: 13.08
> BenchmarkByteBlockOrdSet.testByteBlockOrdSet: [measured 5 out of 7 rounds,
> threads: 1 (sequential)]
> round: 1.91 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+-
> 0.00], GC.calls: 8, GC.time: 0.20, time.total: 13.37, time.warmup: 3.83,
> time.bench: 9.54
> {code}
> But I think the reason it'd be worth looking at this in Lucene is making
> BytesRefHash simpler to understand. For example, put operation in my code
> looks like this:
> {code}
> public int put(ByteBlockRef ref) {
> assert ref != null;
> int slot = hash(ref) & mask;
> for (int id = ordByHash[slot];
> id != EMPTY_SLOT;
> id = ordByHash[slot]) {
> if (blocks.compareTo(id, ref) == 0) {
> return id;
> }
> slot = (slot + 1) & mask;
> }
> int newId = ordByHash[slot] = blocks.size();
> blocks.add(ref);
> if (size() == resizeAt) {
> grow();
> }
> return -newId - 1;
> }
> {code}
> and get is simply delegation to the list of byte sequences:
> {code}
> public ByteBlockRef get(int index, ByteBlockRef ref) {
> return blocks.get(index, ref);
> }
> {code}
> What makes this refactoring slightly more complicated is that there is a fair
> bit of hairy stuff in BytesRefHash that the person doing the refactoring
> would have to look at. The craziest parts are, to me:
> {code}
> private void rehash(final int newSize, boolean hashOnData)
> -> when hashOnData is true this class becomes insane :)
> shrinking is not really used anyway; neither are element removals (and
> they're not really needed I think).
> compacting and in-memory sorting should be moved outside of this class and
> placed somewhere else. You could still grab the internal storage buffers, but
> do the compacting/ in-memory sorting logic somewhere else and thus not leave
> the BytesRefHash object in an invalid/ crazy state.
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]