[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index

Kirill Bobyrev via Phabricator via cfe-commits Fri, 20 Jul 2018 01:39:02 -0700

omtcyfz added inline comments.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+      : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
ioeric wrote:
> ioeric wrote:
> > Should we also incorporate `Type` into hash?
> I'm wondering if we should use different hashes for different token types. 
> For example, a trigram token "xyz" can be encoded in a 4 -byte int with 
> `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
Discussed internally: we probably shouldn't since there will be collisions even 
inside a single token namespace when Data will be too long.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:44
+generateTrigrams(const std::vector<llvm::SmallString<10>> &Segments) {
+  llvm::DenseSet<SearchToken> UniqueTrigrams;
+  std::vector<SearchToken> Trigrams;
----------------
ioeric wrote:
> Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to 
> token?
I'm not sure how to iterate through the result then. Basically, having Trigrams 
ensures that these trigrams can be iterated after the function execution and 
inserted into the inverted index. Otherwise I should either expose a callback 
to add the generated trigrams or do something similar. Should I do that instead?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:57
+  // this case.
+  for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+       ++FirstSegment) {
----------------
ioeric wrote:
> nit: Maybe S1, S2, S3 instead of FirstSegment, ...?
I think it's way less explicit and slightly more confusing. I try not to have 
one-letter names so I guess it's better to have full names instead.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+        SearchToken Trigram(
+            (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+            SearchToken::Kind::Trigram);
----------------
ioeric wrote:
> ioeric wrote:
> > This seems wrong... wouldn't this give you a concatenation of three 
> > segments?
> For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can 
> be reused) and std::move it into the constructor. Might be cheaper than 
> creating a std::string
But `std::string` would be created anyway, wouldn't it be?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:87
+
+    for (size_t Position = 0; Position + 2 < Segment.size(); ++Position)
+      Trigrams.push_back(
----------------
ioeric wrote:
> nit: `Position <  Segment.size() - 2` seems more commonly used.
If `Segment.size()` is 1 then this would be `1U - 2U`, which would be something 
very large.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:138
+
+  for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) {
+    const char CurrentSymbol = SymbolName[Index];
----------------
ioeric wrote:
> Maybe first split name on `_` and them run further upper-lower segmentation 
> on the split segments?
Resolving both of these for now (though they're not really fixed) since I will 
rethink the algorithm given Sam's patch.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:52
+
+    Custom,
+  };
----------------
sammccall wrote:
> what is this for?
Tags, for example. But yes, for now it's not very clear and probably not 
needed. Dropped this one.


https://reviews.llvm.org/D49417



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index

Reply via email to