[ 
https://issues.apache.org/jira/browse/LUCENE-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077751#comment-14077751
 ] 

Michael McCandless commented on LUCENE-5854:
--------------------------------------------

Wow, this would be a nice speedup :)

> BytesRefHash can be simplified (and made faster)
> ------------------------------------------------
>
>                 Key: LUCENE-5854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5854
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Priority: Minor
>             Fix For: 5.0
>
>
> Currently BytesRefHash stores the length of each byte sequence as either one 
> or two bytes inside the byte pool. This is redundant (slows down add 
> operation and increases the required memory).
> Logically, what BytesRefHash does is assign linear IDs (0..n) to each unique 
> byte sequence on input. So what's really needed are two data structures:
> * a compact list of byte sequences (indexed 0...n),
> * a hash table lookup of hash(byteSequence) -> ID.
> The first item is already implemented efficiently in BytesRefArray. Note that 
> the length of each byte sequence is implicitly stored as the difference in 
> starting offsets between the next sequence's start offset and the current 
> sequence (clever!). This doesn't allow for removals, but saves on encoding 
> and representation.
> The second bullet point above is trivial (linear hash table of IDs or -1 
> indicating empty slots).
> I have a clear-room implementation of the above (based on HPPC data 
> structures though) and it does show some performance improvement (on 
> simplistic randomized data benchmarks).
> {code}
> BenchmarkByteBlockOrdSet.testBytesRefHash: [measured 5 out of 7 rounds, 
> threads: 1 (sequential)]
>  round: 2.62 [+- 0.03], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 
> 0.00], GC.calls: 5, GC.time: 0.20, time.total: 18.43, time.warmup: 5.35, 
> time.bench: 13.08
> BenchmarkByteBlockOrdSet.testByteBlockOrdSet: [measured 5 out of 7 rounds, 
> threads: 1 (sequential)]
>  round: 1.91 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 
> 0.00], GC.calls: 8, GC.time: 0.20, time.total: 13.37, time.warmup: 3.83, 
> time.bench: 9.54
> {code}
> But I think the reason it'd be worth looking at this in Lucene is making 
> BytesRefHash simpler to understand. For example, put operation in my code 
> looks like this:
> {code}
>   public int put(ByteBlockRef ref) {
>     assert ref != null; 
>     int slot = hash(ref) & mask;
>     for (int id = ordByHash[slot];
>              id != EMPTY_SLOT;
>              id = ordByHash[slot]) {
>       if (blocks.compareTo(id, ref) == 0) {
>         return id;
>       }
>       slot = (slot + 1) & mask;
>     }
>     int newId = ordByHash[slot] = blocks.size();
>     blocks.add(ref);
>     if (size() == resizeAt) {
>       grow();
>     }
>     return -newId - 1;
>   }
> {code}
> and get is simply delegation to the list of byte sequences:
> {code}
>   public ByteBlockRef get(int index, ByteBlockRef ref) {
>     return blocks.get(index, ref);
>   }
> {code}
> What makes this refactoring slightly more complicated is that there is a fair 
> bit of hairy stuff in BytesRefHash that the person doing the refactoring 
> would have to look at. The craziest parts are, to me:
> {code}
>   private void rehash(final int newSize, boolean hashOnData)
> -> when hashOnData is true this class becomes insane :)
>   shrinking is not really used anyway; neither are element removals (and 
> they're not really needed I think).
>   compacting and in-memory sorting should be moved outside of this class and 
> placed somewhere else. You could still grab the internal storage buffers, but 
> do the compacting/ in-memory sorting logic somewhere else and thus not leave 
> the BytesRefHash object in an invalid/ crazy state.
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to