Dawid Weiss created LUCENE-5854:
-----------------------------------

             Summary: BytesRefHash can be simplified (and made faster)
                 Key: LUCENE-5854
                 URL: https://issues.apache.org/jira/browse/LUCENE-5854
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Dawid Weiss
            Priority: Minor
             Fix For: 5.0


Currently BytesRefHash stores the length of each byte sequence as either one or 
two bytes inside the byte pool. This is redundant (slows down add operation and 
increases the required memory).

Logically, what BytesRefHash does is assign linear IDs (0..n) to each unique 
byte sequence on input. So what's really needed are two data structures:

* a compact list of byte sequences (indexed 0...n),
* a hash table lookup of hash(byteSequence) -> ID.

The first item is already implemented efficiently in BytesRefArray. Note that 
the length of each byte sequence is implicitly stored as the difference in 
starting offsets between the next sequence's start offset and the current 
sequence (clever!). This doesn't allow for removals, but saves on encoding and 
representation.

The second bullet point above is trivial (linear hash table of IDs or -1 
indicating empty slots).

I have a clear-room implementation of the above (based on HPPC data structures 
though) and it does show some performance improvement (on simplistic randomized 
data benchmarks).
{code}
BenchmarkByteBlockOrdSet.testBytesRefHash: [measured 5 out of 7 rounds, 
threads: 1 (sequential)]
 round: 2.62 [+- 0.03], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 5, GC.time: 0.20, time.total: 18.43, time.warmup: 5.35, time.bench: 
13.08
BenchmarkByteBlockOrdSet.testByteBlockOrdSet: [measured 5 out of 7 rounds, 
threads: 1 (sequential)]
 round: 1.91 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 8, GC.time: 0.20, time.total: 13.37, time.warmup: 3.83, time.bench: 
9.54
{code}

But I think the reason it'd be worth looking at this in Lucene is making 
BytesRefHash simpler to understand. For example, put operation in my code looks 
like this:

{code}
  public int put(ByteBlockRef ref) {
    assert ref != null; 

    int slot = hash(ref) & mask;
    for (int id = ordByHash[slot];
             id != EMPTY_SLOT;
             id = ordByHash[slot]) {
      if (blocks.compareTo(id, ref) == 0) {
        return id;
      }
      slot = (slot + 1) & mask;
    }

    int newId = ordByHash[slot] = blocks.size();
    blocks.add(ref);

    if (size() == resizeAt) {
      grow();
    }

    return -newId - 1;
  }
{code}

and get is simply delegation to the list of byte sequences:
{code}
  public ByteBlockRef get(int index, ByteBlockRef ref) {
    return blocks.get(index, ref);
  }
{code}

What makes this refactoring slightly more complicated is that there is a fair 
bit of hairy stuff in BytesRefHash that the person doing the refactoring would 
have to look at. The craziest parts are, to me:
{code}
  private void rehash(final int newSize, boolean hashOnData)
-> when hashOnData is true this class becomes insane :)

  shrinking is not really used anyway; neither are element removals (and 
they're not really needed I think).

  compacting and in-memory sorting should be moved outside of this class and 
placed somewhere else. You could still grab the internal storage buffers, but 
do the compacting/ in-memory sorting logic somewhere else and thus not leave 
the BytesRefHash object in an invalid/ crazy state.
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to