[PATCH] D72874: [clangd] Add a textual fallback for go-to-definition

Sam McCall via Phabricator via cfe-commits Mon, 02 Mar 2020 05:03:28 -0800

sammccall added a comment.

Thanks! The scope looks good to me now, on to implementation details.
I'm being a bit picky on the behaivor because go-to-def is a heavily-used 
feature, many users won't be expecting what we're doing here, and we can't 
reasonably expect them to understand the failure modes.
So, let's try hard not to fail :-)


This reminds me: it's not completely obvious what set of "act on symbol under 
the cursor" things this should (eventually) apply to.
I think not having e.g. find-references work makes sense - user should navigate 
to a "real" occurrence to resolve the ambiguity, and things like code actions 
are right out.
However having `textDocument/hover` work when we have high confidence in 
results would be really cool.
Obviously nothing in scope for this patch, but it seems worth writing this down 
somewhere, precisely because we shouldn't do it now.

In D72874#1900149 <https://reviews.llvm.org/D72874#1900149>, @nridge wrote:

> I currently handle `lowerCamel`, `UpperCamel`, `CAPS`, and `under_scores`. 
> I've left the others as follow-ups.


(sorry for shifting goalposts, I think `CAPS` may be too broad. Left a comment 
inline)

>> - if you get more than 3 results, and none from current file, maybe don't 
>> return anything, as confidence is too low. Or try a stricter query...
> 
> I implemented this, but my testing shows this causes a lot of results for 
> class names to be excluded. The reason appears to be that `fuzzyFind()` 
> returns the class and each of its constructors as distinct results, so if a 
> class has more than two constructors, we'll have more than 3 results (and 
> typically the class is declared in a different file).

I think we should just drop constructor results, they'll always have this 
problem.
(There are other cases but this is the biggest).

>> - handle the most common case of non-indexable symbols (local symbols) by 
>> running the query against the closest occurrence of the token in code.
> 
> I've left this as a follow-up.

Makes sense. I think this there's not a lot of new complexity here, we have the 
major pieces (getWordAtPosition, TokenBuffer, SelectionTree, targetDecl, index) 
but integration is definitely substantial.

I'd suggest we go down that path before adding complexity for the indexed-based 
path though, because I suspect it's going to handle many of the practical 
situations where the index-based approach needs a lot of help (and vice-versa).



================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:313
 
 SourceLocation getBeginningOfIdentifier(const Position &Pos,
                                         const SourceManager &SM,
----------------
@kadircet is working on getting rid of this function because creating raw 
lexers is is wasteful and not actually very powerful. Mostly we're moving to 
syntax::TokenBuffer, which records actual lexed tokens, but that doesn't apply 
here.

The examples in the tests seem like they'd be covered by something *really* 
simple, like enclosing identifier chars:

```
unsigned Begin, End;
for (Begin = Offset; Begin > 0 && isIdentifierBody(Code[Begin-1]); --BeginEnd) 
{}
for (End = Offset; End < Code.size() && isIdentifierBody(Code[End]); ++End) {}
return Code.slice(Begin, End);
```

(Lexer::isIdentifierBodyChar requires langopts but just passes through 
DollarIdents to isIdentifierBody, and I don't think we care much about 
identifiers with $ in them.)

If we really want to do something more subtle here, we should check it in 
SourceCodeTests.


================
Comment at: clang-tools-extra/clangd/SourceCode.h:93
+/// the entire comment or string token.
+SourceRange getWordAtPosition(const Position &Pos, const SourceManager &SM,
+                              const LangOptions &LangOpts);
----------------
consider moving the isLikelyToBeIdentifier check inside. The current API is 
pretty general and it's not clear yet what (else) it's good for so it's nice to 
direct towards intended usage.

Also doing the identifier check inside this function is more convenient when it 
relies on markers outside the identifier range (like doxygen `\p` or 
backtick-quoted identifiers)

That said, you may still want to return the range when it's not a likely 
identifier, with a signature like `StringRef getWordAtPosition(bool 
*LikelyIdentifier = nullptr)`. I'm thinking of the future case where the caller 
wants to find a nearby matching token and resolve it - resolving belongs in the 
caller so there's not much point having this function duplicate the check.


================
Comment at: clang-tools-extra/clangd/SourceCode.h:93
+/// the entire comment or string token.
+SourceRange getWordAtPosition(const Position &Pos, const SourceManager &SM,
+                              const LangOptions &LangOpts);
----------------
sammccall wrote:
> consider moving the isLikelyToBeIdentifier check inside. The current API is 
> pretty general and it's not clear yet what (else) it's good for so it's nice 
> to direct towards intended usage.
> 
> Also doing the identifier check inside this function is more convenient when 
> it relies on markers outside the identifier range (like doxygen `\p` or 
> backtick-quoted identifiers)
> 
> That said, you may still want to return the range when it's not a likely 
> identifier, with a signature like `StringRef getWordAtPosition(bool 
> *LikelyIdentifier = nullptr)`. I'm thinking of the future case where the 
> caller wants to find a nearby matching token and resolve it - resolving 
> belongs in the caller so there's not much point having this function 
> duplicate the check.
This doesn't use the SourceManager-structure of the file, so the natural 
signature would be `StringRef getWordAtPosition(StringRef Code, unsigned 
Offset)`.

(what are the practical cases where langopts is relevant?)


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:198
+bool isLikelyToBeIdentifier(StringRef Word) {
+  // Word contains underscore
+  if (Word.contains('_')) {
----------------
nit: mention snake_case, MACRO_CASE?


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:202
+  }
+  // Word contains capital letter other than at beginning
+  if (Word.substr(1).find_if([](char C) { return std::isupper(C); }) !=
----------------
nit: can you mention this catches lowerCamel and UpperCamel


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:203
+  // Word contains capital letter other than at beginning
+  if (Word.substr(1).find_if([](char C) { return std::isupper(C); }) !=
+      StringRef::npos) {
----------------
nit: prefer llvm::isUppercase to avoid locales


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:203
+  // Word contains capital letter other than at beginning
+  if (Word.substr(1).find_if([](char C) { return std::isupper(C); }) !=
+      StringRef::npos) {
----------------
sammccall wrote:
> nit: prefer llvm::isUppercase to avoid locales
this will fire for initialisms like `HTTP`.
I think we want to require *both* upper and lowercase letters.


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:222
+      return L.first > R.first;
+    return L.second.Name < R.second.Name; // Earlier name is better.
+  }
----------------
I think this is dead - we're just sorting by score.


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:226
+
+std::vector<LocatedSymbol> navigationFallback(ParsedAST &AST,
+                                              const SymbolIndex *Index,
----------------
this function should have a high-level comment describing the strategy and the 
limitations (e.g. idea of extending it to resolve nearby matching tokens).

A name like `locateSymbolNamedTextuallyAt` would better describe what this 
does, rather than what its caller does.

I would strongly consider exposing this function publicly for the detailed 
tests, and only smoke-testing it through `locateSymbolAt`. Having to break the 
AST in tests or otherwise rely on the "primary" logic not working is brittle 
and hard to verify.


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:240
+  Req.ProximityPaths = {MainFilePath};
+  // FIXME: Set Req.Scopes to the lexically enclosing scopes.
+  // For extra strictness, consider AnyScope=false.
----------------
FWIW the API for this is visibleNamespaces() from SourceCode.cpp.
(No enclosing classes, but I suspect we can live without them once we have a 
nearby-tokens solution too)


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:243
+  Req.AnyScope = true;
+  // Choose a limit that's large enough that it contains the user's desired
+  // target even in the presence of some false positives, but small enough that
----------------
If we're bailing out on >3, I think this limit should be aiming to detect when 
there's >3, and avoid fetching way too much data, but *not* trying to avoid 
noise.
(I'd suggest 10 or so)


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:248
+  TopN<ScoredLocatedSymbol, ScoredSymbolGreater> Top(*Req.Limit);
+  FuzzyMatcher Filter(Req.Query);
+  auto MainFileURI = URIForFile::canonicalize(MainFilePath, MainFilePath);
----------------
This seems dead, you're requiring exact matches, these will always have the 
same score.


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:250
+  auto MainFileURI = URIForFile::canonicalize(MainFilePath, MainFilePath);
+  bool HaveResultInMainFile = false;
+  Index->fuzzyFind(Req, [&](const Symbol &Sym) {
----------------
This is an interesting signal, I think there are two sensible ways to go about 
it:
 - assume results in this file are more likely accurate than those in other 
files. In this case we should at minimum be using this in ranking, but really 
we should just drop all cross-file results if we have an in-file one.
 - don't rely on index for main-file cases, and rely on "find nearby matching 
token and resolve it instead". That can easily handled cases defined/referenced 
in the main-file with sufficient accuracy, including non-indexed symbols. So 
here we can assume this signal is always false, and drop it.



================
Comment at: clang-tools-extra/clangd/XRefs.cpp:251
+  bool HaveResultInMainFile = false;
+  Index->fuzzyFind(Req, [&](const Symbol &Sym) {
+    auto Loc = symbolToLocation(Sym, MainFilePath);
----------------
BTW I think the answer for constructors is just to drop all constructor results 
here.

(This also affects template specializations which I think we can not worry 
about, and virtual method hierarchies which are more painful but I also 
wouldn't try to fix now)


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:252
+  Index->fuzzyFind(Req, [&](const Symbol &Sym) {
+    auto Loc = symbolToLocation(Sym, MainFilePath);
+    if (!Loc) {
----------------
I'm not sure why we're using SymbolToLocation here:
 - Main file URI check: the `Symbol` has URIs. They need to be canonicalized to 
file URIs before comparison. This allows checking both decl and def location.
 - PreferredDeclaration and Definition can be more easily set directly from the 
`Symbol`


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:258
+
+    // For now, only consider exact name matches, including case.
+    // This is to avoid too many false positives.
----------------
I wouldn't bother qualifying this as "for now". Any code is subject to change 
in the future, but requiring an exact name match for index-based results seems 
more like a design decision than a fixme.


================
Comment at: clang-tools-extra/clangd/XRefs.cpp:284
+    else {
+      log("Navigation fallback: {0} didn't match query {1}", Sym.Name,
+          Filter.pattern());
----------------
I don't think this should be logged, particularly by default - it doesn't 
really indicate anything other than we should have a "look up symbol by name" 
API

(ok, actually I think this is just dead code because we've already checked name 
above)


================
Comment at: clang-tools-extra/clangd/unittests/XRefsTests.cpp:589
 
+TEST(LocateSymbol, Textual) {
+  const char *Tests[] = {
----------------
`#ifdef`'d out code is another interesting motivation worth testing.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D72874/new/

https://reviews.llvm.org/D72874



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D72874: [clangd] Add a textual fallback for go-to-definition

Reply via email to