ioeric added a comment.

Thanks for the patch!

In, @kbobyrev wrote:

> Complete the tests, finish the implementation.
> One thought about prefix match suggestion: we should either make it more 
> explicit for the index (e.g. introduce `prefixMatch` and dispatch 
> `fuzzyMatch` to prefix matching in case query only contains one "true" 
> symbol) or document this properly. While, as I wrote earlier, I totally 
> support the idea of prefix matching queries of length 1 it might not align 
> with some user expectations and it's also very implicit if we just generate 
> tokens this way and don't mention it anywhere in the `DexIndex` 
> implementation.
> @ioeric, @ilya-biryukov any thoughts?

(copied my inline comment :)
We should definitely add documentation about it. It should be pretty simple 
IMO. As the behavior should be easy to infer from samples, and it shouldn't be 
too surprising for users, I think it would be OK to consider it as 
implementation detail (like details in how exactly trigrams are generated) 
without exposing new interfaces for them.

Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:26
-// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach 
-// be generating unigrams and bigrams here, too. This would prevent symbol 
-// from applying fuzzy matching on a tremendous number of symbols and allow
-// supplementary retrieval for short queries.
-// Short names (total segment length <3 characters) are currently ignored.
+// FIXME(kbobyrev): Posting lists for incomplete trigrams (one/two symbols) are
+// likely to be very dense and hence require special attention so that the 
It's a nice to optimization have when we run into oversized posting lists, but 
this is not necessarily restricted to unigram posting lists. I think the FIXME 
should live near the general posting list code. I think it's probably also ok 
to leave it out; it's hard to forget if we do run into problem in the future ;)

Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74
+    // symbol of the identifier.
+    if (!FoundFirstSymbol) {
+      FoundFirstSymbol = true;
Could this be pulled out of the loop? I think what we want is just 
`LowercaseIdentifier[0]` right?

I'd probably also pulled that into a function, as the function body is getting 

Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:87
+      Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_MARKER, 
+      const auto Bigram = Token(Token::Kind::Trigram,;
I think we could be more restrictive on bigram generation. I think a bigram 
prefix of identifier and a bigram prefix of the HEAD substring should work 
pretty well in practice. For example, for `StringStartsWith`, you would have 
`st$` and `ss$` (prefix of "SSW").


Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115
+// FIXME(kbobyrev): Correctly handle empty trigrams "$$$".
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
   // Apply fuzzy matching text segmentation.
It seems to me that what we need for short queries is simply:
if (Query.empty()) {
   // return empty token
if (Query.size() == 1) return {Query + "$$"};
if (Query.size() == 2) return {Query + "$"};

// Longer queries...

Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:39
+// member of Token even though it's Trigram-specific?
+const auto END_MARKER = '$';
Any reason why this should be exposed?

Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:62
 /// belongs to more than one class it is only inserted once.
+// FIXME(kbobyrev): Document somewhere in DexIndex that for queries of size 1
+// it will do prefix matching instead of fuzzy matching on the identifier 
The behavior should be easy to infer from samples. As long as it's not totally 
expected, I think it would be OK to consider treat as implementation detail 
(like details in how trigrams are generated).

Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:74
+/// For short queries (Query contains less than 3 letters and digits) this
+/// returns a single trigram with all valid symbols.
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);
I'm not quite sure what this means. Could you elaborate?

cfe-commits mailing list

Reply via email to