omtcyfz added inline comments.
================ Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53 + SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram) + : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {} + ---------------- ioeric wrote: > ioeric wrote: > > Should we also incorporate `Type` into hash? > I'm wondering if we should use different hashes for different token types. > For example, a trigram token "xyz" can be encoded in a 4 -byte int with > `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`. Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long. ================ Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:44 +generateTrigrams(const std::vector<llvm::SmallString<10>> &Segments) { + llvm::DenseSet<SearchToken> UniqueTrigrams; + std::vector<SearchToken> Trigrams; ---------------- ioeric wrote: > Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to > token? I'm not sure how to iterate through the result then. Basically, having Trigrams ensures that these trigrams can be iterated after the function execution and inserted into the inverted index. Otherwise I should either expose a callback to add the generated trigrams or do something similar. Should I do that instead? ================ Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:57 + // this case. + for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end(); + ++FirstSegment) { ---------------- ioeric wrote: > nit: Maybe S1, S2, S3 instead of FirstSegment, ...? I think it's way less explicit and slightly more confusing. I try not to have one-letter names so I guess it's better to have full names instead. ================ Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68 + SearchToken Trigram( + (*FirstSegment + *SecondSegment + *ThirdSegment).str(), + SearchToken::Kind::Trigram); ---------------- ioeric wrote: > ioeric wrote: > > This seems wrong... wouldn't this give you a concatenation of three > > segments? > For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can > be reused) and std::move it into the constructor. Might be cheaper than > creating a std::string But `std::string` would be created anyway, wouldn't it be? ================ Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:87 + + for (size_t Position = 0; Position + 2 < Segment.size(); ++Position) + Trigrams.push_back( ---------------- ioeric wrote: > nit: `Position < Segment.size() - 2` seems more commonly used. If `Segment.size()` is 1 then this would be `1U - 2U`, which would be something very large. ================ Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:138 + + for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) { + const char CurrentSymbol = SymbolName[Index]; ---------------- ioeric wrote: > Maybe first split name on `_` and them run further upper-lower segmentation > on the split segments? Resolving both of these for now (though they're not really fixed) since I will rethink the algorithm given Sam's patch. ================ Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:52 + + Custom, + }; ---------------- sammccall wrote: > what is this for? Tags, for example. But yes, for now it's not very clear and probably not needed. Dropped this one. https://reviews.llvm.org/D49417 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits