[PATCH] D52300: [clangd] Implement VByte PostingList compression

Sam McCall via Phabricator via cfe-commits Thu, 20 Sep 2018 09:50:30 -0700

sammccall added a comment.

Very nice!


I think the data structures can be slightly tighter, and some of the 
implementation could be easier to follow. But this seems like a nice win.

Right-sizing the vectors seems like an important optimization.



================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:29
+      DecompressedChunk = ChunkIndex->decompress();
+      InnerIndex = begin(DecompressedChunk);
+    }
----------------
nit: we generally use members (DecompressedChunk.begin()) unless actually 
dealing with arrays or templates, since lookup rules are simpler
 


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:39
            "Posting List iterator can't advance() at the end.");
-    ++Index;
+    if (++InnerIndex != end(DecompressedChunk))
+      return; // Return if current chunk is not exhausted.
----------------
nit: I think this might be clearer with the special/unlikely cases (hit end) 
inside the if:
```
if (++InnerIndex == DecompressedChunks.begin()) { // end of chunk
  if (++ChunkIndex == Chunks.end()) // end of iterator
    return;
  DecompressedChunk = ChunkIndex->decompress();
  InnerIndex = DecompressedChunk.begin();
}
```
also I think the indirection via `reachedEnd()` mostly serves to confuse here, 
as the other lines deal with the data structures directly. It's not clear 
(without reading the implementation) what the behavior is when class invariants 
are violated.


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:58
+    // does.
+    if ((ChunkIndex != end(Chunks) - 1) && ((ChunkIndex + 1)->Head <= ID)) {
+      // Find the next chunk with Head >= ID.
----------------
this again puts the "normal case" (need to choose a chunk) inside the if(), 
instead of the exceptional case.

In order to write this more naturally, I think pulling out a private helper 
`advanceToChunk(DocID)` might be best here, you can early return from there.


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:61
+      ChunkIndex = std::lower_bound(
+          ChunkIndex, end(Chunks), ID,
+          [](const Chunk &C, const DocID ID) { return C.Head < ID; });
----------------
ChunkIndex + 1? You've already eliminated the current chunk.


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:62
+          ChunkIndex, end(Chunks), ID,
+          [](const Chunk &C, const DocID ID) { return C.Head < ID; });
+      if (reachedEnd())
----------------
This seems unneccesarily two-step (found the chunk... or it could be the first 
element of the next).
Understandably, because std::*_bound has such a silly API.

You want to find the *last* chunk such that Head <= ID.
So find the first one with Head > ID, and subtract one.

std::lower_bound returns the first element for which its predicate is false.

Therefore:
```
ChunkIndex = std::lower_bound(ChunkIndex, Chunks.end(), ID,
     [](const Chunk &C, const DocID D) { return C.Head <= ID; }) - 1;
```


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:63
+          [](const Chunk &C, const DocID ID) { return C.Head < ID; });
+      if (reachedEnd())
+        return;
----------------
(again I'd avoid reachedEnd() here as you haven't reestablished invariants, so 
it's easier to just deal with the data structures)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:76
+    // Return if the position was found in current chunk.
+    if (InnerIndex != std::end(DecompressedChunk))
+      return;
----------------
(this can become an assert)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:115
+  // Iterator over chunks.
+  std::vector<Chunk>::const_iterator ChunkIndex;
+  std::vector<DocID> DecompressedChunk;
----------------
nit: please don't call these indexes if they're actually iterators: 
CurrentChunk seems fine


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:116
+  std::vector<Chunk>::const_iterator ChunkIndex;
+  std::vector<DocID> DecompressedChunk;
+  // Iterator over DecompressedChunk.
----------------
(again, SmallVector)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:121
 
+/// Single-byte masks used for VByte compression bit manipulations.
+constexpr uint8_t SevenBytesMask = 0x7f;  // 0b01111111
----------------
move to the function where they're used


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:127
+/// Fills chunk with the maximum number of bits available.
+Chunk createChunk(DocID Head, std::queue<uint8_t> &Payload,
+                  size_t DocumentsCount, size_t MeaningfulBytes) {
----------------
What's the purpose of this? Why can't the caller just construct the Chunk 
themselves - what does the std::queue buy us?


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:140
+
+/// Byte offsets of Payload contents within DocID.
+const size_t Offsets[] = {0, 7, 7 * 2, 7 * 3, 7 * 4};
----------------
I don't understand this comment. Aren't these bit offsets of payload bytes 
within a DocID?


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:162
+/// Encoding  (raw number)  10000101  10110110 00101110
+std::vector<Chunk> encodeStream(llvm::ArrayRef<DocID> Documents) {
+  // Masks are used to perform bit manipulations over DocID. Each mask
----------------
This appears to be more complicated than necessary.
I'd suggest pulling out the following function, and seeing where it takes you:

```
// Write a variable length into the buffer, and updates the buffer size.
// If it doesn't fit, returns false and doesn't write to the buffer.
bool encodeVByte(uint32 V, MutableArrayRef<uint8_t>& Buf);
```

Personally I find the no-loop implementation much easier to read, just:
```
if (V < (1<<7)) {
  if (Buf.size() < 1)
    return false;
  Buf[0] = V;
  Buf  = Buf.drop_front(1);
  return true;
}
// and 4 more cases
```
but up to you. Please do try to find a way to reduce the number of constants 
(masks, limits, offsets, *BytesMask...) if you keep the loop.


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:248
+    ++Length;
+    // Add document decoding to the result as soon as END marker is seen.
+    if ((Byte & ContinuationBit) != 0) {
----------------
the logical structure seems like a nested loop, I think this would be easier to 
follow:

```
for (Current = Head; have more bytes and not enough numbers; Current += delta) {
 delta = 0;
 continuation = true;
 while (continuation) {
  ...
 }
 Result.push_back(Current + delta;)
}
```


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:262
+             "Unterminated byte sequence at the end of input stream.");
+  return Result;
+}
----------------
here I think you're missing a memory optimization probably equal in size to the 
whole gains achieved by compression :-)

libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting an 
extra 30% or so of ram (depending on size distribution, I forget the theory 
here).
We should shrink to fit. If we were truly desperate we'd iterate over all the 
numbers and presize the array, but we're probably not.

I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will 
shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:41
+/// decompressed upon request.
+struct Chunk {
+  // Keep sizeof(Chunk) == 32.
----------------
With the current implementation, this doesn't need to be in the header.
(the layout of vector<chunk> doesn't depend on chunk, you should just need to 
out-line the default destructor)

(using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not 
though. I'd either stick with std::vector, or measure)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:42
+struct Chunk {
+  // Keep sizeof(Chunk) == 32.
+  static constexpr size_t PayloadSize = 32 - sizeof(DocID) - sizeof(uint8_t);
----------------
make this a static_assert below the class?


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:45
+
+  std::vector<DocID> decompress() const;
+
----------------
return SmallVector<PayloadSize+1> to avoid allocations?


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:52
+  /// Number of elements encoded into Payload + 1.
+  uint8_t Size;
+};
----------------
This seems like a waste of a byte - ensure padding bytes are zeros, then you're 
done decoding once you hit a zero byte or the end of the chunk.
(Note that a zero byte encodes the integer zero, which is not a legal posting 
list delta)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:59
+/// positively on the Index performance, current Dex implementation has a large
+/// performance gap compared to MemIndex which allows memory consumption
+/// reduction at the cost of some performance.
----------------
this isn't a good justification - the performance of MemIndex isn't really 
relevant.
"Compression saves memory at a small cost in access time, which is still fast 
enough in practice."


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:70
+  /// Returns in-memory size.
+  size_t bytes() const { return Chunks.size() * sizeof(Chunk); }
 
----------------
this should be Chunks.capacity() (see comment in other file)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:74
+  const std::vector<Chunk> Chunks;
+  size_t Size;
 };
----------------
this may seem picky, but this seems like a waste of 8 bytes (particularly for 
small posting lists).
I'd suggest just defining a constant (in Chunk) for the estimated entries per 
chunk (maybe 15 or so?) and just using `Chunks.size() * 
Chunks::ApproxEntriesPerChunk` as a "good enough" estimate.


================
Comment at: clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp:1
+//===-- VByteFuzzer.cpp - Fuzz VByte Posting List encoding 
----------------===//
+//
----------------
For better or worse, adding a fuzzer in the open-source project is pretty high 
ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following up on 
bugs).

I'm not sure the maintenance cost is justified here. Can we just run the fuzzer 
but not check it in?


https://reviews.llvm.org/D52300



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D52300: [clangd] Implement VByte PostingList compression

Reply via email to