kbobyrev added inline comments.

Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74
+    // symbol of the identifier.
+    if (!FoundFirstSymbol) {
+      FoundFirstSymbol = true;
ioeric wrote:
> Could this be pulled out of the loop? I think what we want is just 
> `LowercaseIdentifier[0]` right?
> I'd probably also pulled that into a function, as the function body is 
> getting larger.
Same as elsewhere, if we have `__builtin_whatever` the it's not actually the 
first symbol of the lowercase identifier.

Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:87
+      Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_MARKER, 
+      const auto Bigram = Token(Token::Kind::Trigram, Chars.data());
ioeric wrote:
> I think we could be more restrictive on bigram generation. I think a bigram 
> prefix of identifier and a bigram prefix of the HEAD substring should work 
> pretty well in practice. For example, for `StringStartsWith`, you would have 
> `st$` and `ss$` (prefix of "SSW").
Good idea!

Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115
+// FIXME(kbobyrev): Correctly handle empty trigrams "$$$".
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
   // Apply fuzzy matching text segmentation.
ioeric wrote:
> It seems to me that what we need for short queries is simply:
> ```
> if (Query.empty()) {
>    // return empty token
> }
> if (Query.size() == 1) return {Query + "$$"};
> if (Query.size() == 2) return {Query + "$"};
> // Longer queries...
> ```
> ?
That would mean that we expect the query to be "valid", i.e. only consist of 
letters and digits. My concern is about what happens if we have `"u_"` or 
something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually 
still have to identify the first valid symbol for the trigram, process the 
string (trim it, etc) which looks very similar to what FuzzyMatching 
`calculateRoles` does.

The current approach is rather straightforward and generic, but I can try to 
change it if you want. My biggest concern is fighting some corner cases and 
ensuring that the query is "right" on the user (index) side, which might turn 
out to be more code and ensuring that the "state" is valid throughout the 

Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:74
+/// For short queries (Query contains less than 3 letters and digits) this
+/// returns a single trigram with all valid symbols.
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);
ioeric wrote:
> I'm not quite sure what this means. Could you elaborate?
Added an example and reflected in the other comment.


cfe-commits mailing list

Reply via email to