[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-13 Thread Haojian Wu via Phabricator via cfe-commits
hokein added inline comments.



Comment at: clangd/index/Index.h:52
+private:
+  friend class llvm::DenseMapInfo;
+

sammccall wrote:
> ioeric wrote:
> > warning: class 'DenseMapInfo' was previously declared as a struct 
> > [-Wmismatched-tags]
> Fixed in rCTE320554
Thanks for the fix!


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-13 Thread Sam McCall via Phabricator via cfe-commits
sammccall added inline comments.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

malaperle wrote:
> sammccall wrote:
> > malaperle wrote:
> > > malaperle wrote:
> > > > sammccall wrote:
> > > > > hokein wrote:
> > > > > > malaperle wrote:
> > > > > > > sammccall wrote:
> > > > > > > > hokein wrote:
> > > > > > > > > malaperle wrote:
> > > > > > > > > > I think it would be nice to have methods as an interface to 
> > > > > > > > > > get this data instead of storing them directly. So that an 
> > > > > > > > > > index-on-disk could go fetch the data. Especially the 
> > > > > > > > > > occurrences which can take a lot of memory (I'm working on 
> > > > > > > > > > a branch that does that). But perhaps defining that 
> > > > > > > > > > interface is not within the scope of this patch and could 
> > > > > > > > > > be better discussed in D40548 ?
> > > > > > > > > I agree. We can't load all the symbol occurrences into the 
> > > > > > > > > memory since they are too large. We need to design interface 
> > > > > > > > > for the symbol occurrences. 
> > > > > > > > > 
> > > > > > > > > We could discuss the interface here, but CodeCompletion is 
> > > > > > > > > the main thing which this patch focuses on. 
> > > > > > > > > We can't load all the symbol occurrences into the memory 
> > > > > > > > > since they are too large
> > > > > > > > 
> > > > > > > > I've heard this often, but never backed up by data :-)
> > > > > > > > 
> > > > > > > > Naively an array of references for a symbol could be doc ID + 
> > > > > > > > offset + length, let's say 16 bytes.
> > > > > > > > 
> > > > > > > > If a source file consisted entirely of references to 
> > > > > > > > 1-character symbols separated by punctuation (1 reference per 2 
> > > > > > > > bytes) then the total size of these references would be 8x the 
> > > > > > > > size of the source file - in practice much less. That's not 
> > > > > > > > very big.
> > > > > > > > 
> > > > > > > > (Maybe there are edge cases with macros/templates, but we can 
> > > > > > > > keep them under control)
> > > > > > > I'd have to break down how much memory it used by what, I'll come 
> > > > > > > back to you on that. Indexing llvm with ClangdIndexDataStorage, 
> > > > > > > which is pretty packed is about 200MB. That's already a lot 
> > > > > > > considering we want to index code bases many times bigger. But 
> > > > > > > I'll try to come up with more precise numbers. I'm open to 
> > > > > > > different strategies.
> > > > > > > 
> > > > > > I can see two points here:
> > > > > > 
> > > > > > 1. For all symbol occurrences of a TU, it is not quite large, and 
> > > > > > we can keep them in memory.
> > > > > > 2. For all symbol occurrences of a whole project, it's not a good 
> > > > > > idea to load all of them into memory (For LLVM project, the size of 
> > > > > > YAML dataset is ~1.2G).  
> > > > > (This is still a sidebar - not asking for any changes)
> > > > > 
> > > > > The YAML dataset is not a good proxy for how big the data is (at 
> > > > > least without an effort to estimate constant factor).
> > > > > And "it's not a good idea" isn't an assertion that can hold without 
> > > > > reasons, assumptions, and data.
> > > > > If the size turns out to be, say, 120MB for LLVM, and we want to 
> > > > > scale to 10x that, and we're spending 500MB/file for ASTs, then it 
> > > > > might well be a good trade.
> > > > > The YAML dataset is not a good proxy for how big the data is (at 
> > > > > least without an effort to estimate constant factor).
> > > > 
> > > > Indeed. I'll try to come up with more realistic numbers. There are 
> > > > other things not accounted for in the 16 bytes mentioned above, like 
> > > > storing roles and relations.
> > > > 
> > > > > 500MB/file for ASTs
> > > > 
> > > > What do you mean? 500MB worth of occurrences per file? Or Preambles 
> > > > perhaps?
> > > > What do you mean? 500MB worth of occurrences per file? Or Preambles 
> > > > perhaps?
> > > 
> > > Oh I see, the AST must be in memory for fast reparse. I just tried 
> > > opening 3 files at the same time I it was already around 500MB. Hmm, 
> > > that's a bit alarming.
> > Right, just that we have to consider RAM usage for the index in the context 
> > of clangd's overall requirements - if other parts of clangd use 1G of ram 
> > for typical work on a large project, then we shouldn't rule out spending a 
> > couple of 100MB on the index if it adds a lot of value.
> Agreed we have to consider the overall requirements. I think over 1GB of RAM 
> is not good for our use case, whether is comes from the AST or index. I think 
> it's perfectly normal if we have different requirements but we can see 
> discuss how to design things so there are options to use either more RAM or 
> disk space. It seems the AST would be the most important factor for now so 
> perhaps 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added inline comments.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

sammccall wrote:
> malaperle wrote:
> > malaperle wrote:
> > > sammccall wrote:
> > > > hokein wrote:
> > > > > malaperle wrote:
> > > > > > sammccall wrote:
> > > > > > > hokein wrote:
> > > > > > > > malaperle wrote:
> > > > > > > > > I think it would be nice to have methods as an interface to 
> > > > > > > > > get this data instead of storing them directly. So that an 
> > > > > > > > > index-on-disk could go fetch the data. Especially the 
> > > > > > > > > occurrences which can take a lot of memory (I'm working on a 
> > > > > > > > > branch that does that). But perhaps defining that interface 
> > > > > > > > > is not within the scope of this patch and could be better 
> > > > > > > > > discussed in D40548 ?
> > > > > > > > I agree. We can't load all the symbol occurrences into the 
> > > > > > > > memory since they are too large. We need to design interface 
> > > > > > > > for the symbol occurrences. 
> > > > > > > > 
> > > > > > > > We could discuss the interface here, but CodeCompletion is the 
> > > > > > > > main thing which this patch focuses on. 
> > > > > > > > We can't load all the symbol occurrences into the memory since 
> > > > > > > > they are too large
> > > > > > > 
> > > > > > > I've heard this often, but never backed up by data :-)
> > > > > > > 
> > > > > > > Naively an array of references for a symbol could be doc ID + 
> > > > > > > offset + length, let's say 16 bytes.
> > > > > > > 
> > > > > > > If a source file consisted entirely of references to 1-character 
> > > > > > > symbols separated by punctuation (1 reference per 2 bytes) then 
> > > > > > > the total size of these references would be 8x the size of the 
> > > > > > > source file - in practice much less. That's not very big.
> > > > > > > 
> > > > > > > (Maybe there are edge cases with macros/templates, but we can 
> > > > > > > keep them under control)
> > > > > > I'd have to break down how much memory it used by what, I'll come 
> > > > > > back to you on that. Indexing llvm with ClangdIndexDataStorage, 
> > > > > > which is pretty packed is about 200MB. That's already a lot 
> > > > > > considering we want to index code bases many times bigger. But I'll 
> > > > > > try to come up with more precise numbers. I'm open to different 
> > > > > > strategies.
> > > > > > 
> > > > > I can see two points here:
> > > > > 
> > > > > 1. For all symbol occurrences of a TU, it is not quite large, and we 
> > > > > can keep them in memory.
> > > > > 2. For all symbol occurrences of a whole project, it's not a good 
> > > > > idea to load all of them into memory (For LLVM project, the size of 
> > > > > YAML dataset is ~1.2G).  
> > > > (This is still a sidebar - not asking for any changes)
> > > > 
> > > > The YAML dataset is not a good proxy for how big the data is (at least 
> > > > without an effort to estimate constant factor).
> > > > And "it's not a good idea" isn't an assertion that can hold without 
> > > > reasons, assumptions, and data.
> > > > If the size turns out to be, say, 120MB for LLVM, and we want to scale 
> > > > to 10x that, and we're spending 500MB/file for ASTs, then it might well 
> > > > be a good trade.
> > > > The YAML dataset is not a good proxy for how big the data is (at least 
> > > > without an effort to estimate constant factor).
> > > 
> > > Indeed. I'll try to come up with more realistic numbers. There are other 
> > > things not accounted for in the 16 bytes mentioned above, like storing 
> > > roles and relations.
> > > 
> > > > 500MB/file for ASTs
> > > 
> > > What do you mean? 500MB worth of occurrences per file? Or Preambles 
> > > perhaps?
> > > What do you mean? 500MB worth of occurrences per file? Or Preambles 
> > > perhaps?
> > 
> > Oh I see, the AST must be in memory for fast reparse. I just tried opening 
> > 3 files at the same time I it was already around 500MB. Hmm, that's a bit 
> > alarming.
> Right, just that we have to consider RAM usage for the index in the context 
> of clangd's overall requirements - if other parts of clangd use 1G of ram for 
> typical work on a large project, then we shouldn't rule out spending a couple 
> of 100MB on the index if it adds a lot of value.
Agreed we have to consider the overall requirements. I think over 1GB of RAM is 
not good for our use case, whether is comes from the AST or index. I think it's 
perfectly normal if we have different requirements but we can see discuss how 
to design things so there are options to use either more RAM or disk space. It 
seems the AST would be the most important factor for now so perhaps it's 
something we should start investigating/discussing.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Eric Liu via Phabricator via cfe-commits
ioeric added inline comments.



Comment at: clangd/index/Index.h:52
+private:
+  friend class llvm::DenseMapInfo;
+

warning: class 'DenseMapInfo' was previously declared as a struct 
[-Wmismatched-tags]


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Haojian Wu via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rCTE320486: [clangd] Introduce a Symbol class. 
(authored by hokein, committed by ).

Changed prior to commit:
  https://reviews.llvm.org/D40897?vs=126538=126548#toc

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/index/Index.cpp
  clangd/index/Index.h
  clangd/index/SymbolCollector.cpp
  clangd/index/SymbolCollector.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -16,6 +16,7 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- unittests/clangd/SymbolCollectorTests.cpp
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,110 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "index/SymbolCollector.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::UnorderedElementsAre;
+using testing::Eq;
+using testing::Field;
+
+// GMock helpers for matching Symbol.
+MATCHER_P(QName, Name, "") { return arg.second.QualifiedName == Name; }
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::FrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return Action;
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->takeSymbols();
+return true;
+  }
+
+protected:
+  SymbolSlab Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+  void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols, UnorderedElementsAre(QName("Foo"), QName("Foo::f"),
+QName("f1"), QName("f2")));
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: clangd/index/SymbolCollector.cpp
===
--- clangd/index/SymbolCollector.cpp
+++ clangd/index/SymbolCollector.cpp
@@ -0,0 +1,102 @@
+//===--- SymbolCollector.cpp -*- C++-*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Haojian Wu via Phabricator via cfe-commits
hokein marked an inline comment as done.
hokein added a comment.

I'm going to submit this patch to unblock the stuff in 
https://reviews.llvm.org/D40548. Would be happy to address any further comments 
afterwards.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Haojian Wu via Phabricator via cfe-commits
hokein marked an inline comment as done.
hokein added inline comments.



Comment at: clangd/index/CMakeLists.txt:5
+
+add_clang_library(clangdIndex
+  Index.cpp

sammccall wrote:
> hmm, I'm not sure whether we actually want this to be a separate library. 
> This means we can't depend on anything from elsewhere in clangd, and may 
> force us to create more components.
> 
> e.g. if we want to pass contexts into the index, or if we want to reuse LSP 
> data models from protocol.h.
> 
> Maybe we should make this part of the main clangd lib, what do you think?
As discussed offline, made it to the main clangd library.



Comment at: clangd/index/Index.cpp:34
+
+SymbolSlab::const_iterator SymbolSlab::find(const SymbolID& SymID) const {
+  return Symbols.find(SymID);

sammccall wrote:
> hokein wrote:
> > sammccall wrote:
> > > assert frozen? (and in begin())
> > We may not do the assert "frozen" in these getter methods -- as writers may 
> > also have needs to access these APIs (e.g. checking whether SymbolID is 
> > already in the slab before constructing and inserting a new Symbol).
> Right, I'd specifically like not to allow that if possible - it makes it very 
> difficult to understand what the invariants are and what operations are 
> allowed.
> 
> I think there's no such need now, right? If we want to add one in future, we 
> can relax this check, or add a builder, or similar - at least we should 
> explicitly consider it.
hmm, SymbolCollector has such need at the moment -- ` if (Symbols.find(ID) != 
Symbols.end()) return true;` to avoid creating a duplicated Symbol.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Sam McCall via Phabricator via cfe-commits
sammccall added inline comments.



Comment at: clangd/index/CMakeLists.txt:5
+
+add_clang_library(clangdIndex
+  Index.cpp

hmm, I'm not sure whether we actually want this to be a separate library. This 
means we can't depend on anything from elsewhere in clangd, and may force us to 
create more components.

e.g. if we want to pass contexts into the index, or if we want to reuse LSP 
data models from protocol.h.

Maybe we should make this part of the main clangd lib, what do you think?



Comment at: clangd/index/Index.cpp:34
+
+SymbolSlab::const_iterator SymbolSlab::find(const SymbolID& SymID) const {
+  return Symbols.find(SymID);

hokein wrote:
> sammccall wrote:
> > assert frozen? (and in begin())
> We may not do the assert "frozen" in these getter methods -- as writers may 
> also have needs to access these APIs (e.g. checking whether SymbolID is 
> already in the slab before constructing and inserting a new Symbol).
Right, I'd specifically like not to allow that if possible - it makes it very 
difficult to understand what the invariants are and what operations are allowed.

I think there's no such need now, right? If we want to add one in future, we 
can relax this check, or add a builder, or similar - at least we should 
explicitly consider it.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 126538.
hokein added a comment.

- Get rid of clangdIndex library, using the existing clangDaemon library.
- Remove the getID() method.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/index/Index.cpp
  clangd/index/Index.h
  clangd/index/SymbolCollector.cpp
  clangd/index/SymbolCollector.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- /dev/null
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,110 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "index/SymbolCollector.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::UnorderedElementsAre;
+using testing::Eq;
+using testing::Field;
+
+// GMock helpers for matching Symbol.
+MATCHER_P(QName, Name, "") { return arg.second.QualifiedName == Name; }
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::FrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return Action;
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->takeSymbols();
+return true;
+  }
+
+protected:
+  SymbolSlab Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+  void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols, UnorderedElementsAre(QName("Foo"), QName("Foo::f"),
+QName("f1"), QName("f2")));
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -15,6 +15,7 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
Index: clangd/index/SymbolCollector.h
===
--- /dev/null
+++ clangd/index/SymbolCollector.h
@@ -0,0 +1,43 @@
+//===--- SymbolCollector.h ---*- C++-*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "Index.h"
+
+#include "clang/Index/IndexDataConsumer.h"
+#include "clang/Index/IndexSymbol.h"
+
+namespace 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Haojian Wu via Phabricator via cfe-commits
hokein added inline comments.



Comment at: clangd/index/Index.cpp:34
+
+SymbolSlab::const_iterator SymbolSlab::find(const SymbolID& SymID) const {
+  return Symbols.find(SymID);

sammccall wrote:
> assert frozen? (and in begin())
We may not do the assert "frozen" in these getter methods -- as writers may 
also have needs to access these APIs (e.g. checking whether SymbolID is already 
in the slab before constructing and inserting a new Symbol).



Comment at: clangd/index/SymbolCollector.h:35
+
+  StringRef getFilename() const {
+return Filename;

sammccall wrote:
> What's this for? Seems like we should be able to handle multiple TUs with one 
> collector?
We don't have particular usage for this method except for testing. Removed it.  


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-12 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 126519.
hokein marked 4 inline comments as done.
hokein added a comment.

Address remaining comments.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/index/CMakeLists.txt
  clangd/index/Index.cpp
  clangd/index/Index.h
  clangd/index/SymbolCollector.cpp
  clangd/index/SymbolCollector.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- /dev/null
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,110 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "index/SymbolCollector.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::UnorderedElementsAre;
+using testing::Eq;
+using testing::Field;
+
+// GMock helpers for matching Symbol.
+MATCHER_P(QName, Name, "") { return arg.second.QualifiedName == Name; }
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::FrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return Action;
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->takeSymbols();
+return true;
+  }
+
+protected:
+  SymbolSlab Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+  void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols, UnorderedElementsAre(QName("Foo"), QName("Foo::f"),
+QName("f1"), QName("f2")));
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -15,12 +15,14 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
   PRIVATE
   clangBasic
   clangDaemon
+  clangdIndex
   clangFormat
   clangFrontend
   clangSema
Index: clangd/index/SymbolCollector.h
===
--- /dev/null
+++ clangd/index/SymbolCollector.h
@@ -0,0 +1,43 @@
+//===--- SymbolCollector.h ---*- C++-*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "Index.h"
+

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-11 Thread Sam McCall via Phabricator via cfe-commits
sammccall accepted this revision.
sammccall added a comment.
This revision is now accepted and ready to land.

\o/




Comment at: clangd/index/Index.cpp:34
+
+SymbolSlab::const_iterator SymbolSlab::find(const SymbolID& SymID) const {
+  return Symbols.find(SymID);

assert frozen? (and in begin())



Comment at: clangd/index/Index.h:45
+  SymbolID(llvm::StringRef USR);
+  std::array HashValue;
+};

nit: make HashValue private?

provide operator== (and use it from DenseMapInfo).



Comment at: clangd/index/Index.h:108
+  static inline clang::clangd::SymbolID getEmptyKey() {
+return clang::clangd::SymbolID("EMPTYKEY");
+  }

nit: you may want to memoize this in a local static variable, rather than 
compute it each time: DenseMap calls it a lot.



Comment at: clangd/index/SymbolCollector.cpp:37
+ << '\n';
+  // Handle symbolic link path cases.
+  // We are trying to get the real file path of the symlink.

Can you spell out here which symbolic link cases we're handling, and what 
problems we're trying to avoid?

Offline, we talked about the CWD being a symlink. But this is a different 
case...



Comment at: clangd/index/SymbolCollector.h:35
+
+  StringRef getFilename() const {
+return Filename;

What's this for? Seems like we should be able to handle multiple TUs with one 
collector?


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-11 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 126378.
hokein marked 5 inline comments as done.
hokein added a comment.

Address comments on SymbolID.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/index/CMakeLists.txt
  clangd/index/Index.cpp
  clangd/index/Index.h
  clangd/index/SymbolCollector.cpp
  clangd/index/SymbolCollector.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- /dev/null
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,112 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "index/SymbolCollector.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::UnorderedElementsAre;
+using testing::Eq;
+using testing::Field;
+
+// GMock helpers for matching Symbol.
+MATCHER_P(QName, Name, "") { return arg.second.QualifiedName == Name; }
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::FrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return Action;
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->takeSymbols();
+
+EXPECT_EQ(FileName, Factory->Collector->getFilename());
+return true;
+  }
+
+protected:
+  SymbolSlab Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+  void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols, UnorderedElementsAre(QName("Foo"), QName("Foo::f"),
+QName("f1"), QName("f2")));
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -15,12 +15,14 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
   PRIVATE
   clangBasic
   clangDaemon
+  clangdIndex
   clangFormat
   clangFrontend
   clangSema
Index: clangd/index/SymbolCollector.h
===
--- /dev/null
+++ clangd/index/SymbolCollector.h
@@ -0,0 +1,52 @@
+//===--- SymbolCollector.h ---*- C++-*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-11 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

Thanks for the restructuring? I want to take another pass today, but wanted to 
mention some SymbolID things.




Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

malaperle wrote:
> malaperle wrote:
> > sammccall wrote:
> > > hokein wrote:
> > > > malaperle wrote:
> > > > > sammccall wrote:
> > > > > > hokein wrote:
> > > > > > > malaperle wrote:
> > > > > > > > I think it would be nice to have methods as an interface to get 
> > > > > > > > this data instead of storing them directly. So that an 
> > > > > > > > index-on-disk could go fetch the data. Especially the 
> > > > > > > > occurrences which can take a lot of memory (I'm working on a 
> > > > > > > > branch that does that). But perhaps defining that interface is 
> > > > > > > > not within the scope of this patch and could be better 
> > > > > > > > discussed in D40548 ?
> > > > > > > I agree. We can't load all the symbol occurrences into the memory 
> > > > > > > since they are too large. We need to design interface for the 
> > > > > > > symbol occurrences. 
> > > > > > > 
> > > > > > > We could discuss the interface here, but CodeCompletion is the 
> > > > > > > main thing which this patch focuses on. 
> > > > > > > We can't load all the symbol occurrences into the memory since 
> > > > > > > they are too large
> > > > > > 
> > > > > > I've heard this often, but never backed up by data :-)
> > > > > > 
> > > > > > Naively an array of references for a symbol could be doc ID + 
> > > > > > offset + length, let's say 16 bytes.
> > > > > > 
> > > > > > If a source file consisted entirely of references to 1-character 
> > > > > > symbols separated by punctuation (1 reference per 2 bytes) then the 
> > > > > > total size of these references would be 8x the size of the source 
> > > > > > file - in practice much less. That's not very big.
> > > > > > 
> > > > > > (Maybe there are edge cases with macros/templates, but we can keep 
> > > > > > them under control)
> > > > > I'd have to break down how much memory it used by what, I'll come 
> > > > > back to you on that. Indexing llvm with ClangdIndexDataStorage, which 
> > > > > is pretty packed is about 200MB. That's already a lot considering we 
> > > > > want to index code bases many times bigger. But I'll try to come up 
> > > > > with more precise numbers. I'm open to different strategies.
> > > > > 
> > > > I can see two points here:
> > > > 
> > > > 1. For all symbol occurrences of a TU, it is not quite large, and we 
> > > > can keep them in memory.
> > > > 2. For all symbol occurrences of a whole project, it's not a good idea 
> > > > to load all of them into memory (For LLVM project, the size of YAML 
> > > > dataset is ~1.2G).  
> > > (This is still a sidebar - not asking for any changes)
> > > 
> > > The YAML dataset is not a good proxy for how big the data is (at least 
> > > without an effort to estimate constant factor).
> > > And "it's not a good idea" isn't an assertion that can hold without 
> > > reasons, assumptions, and data.
> > > If the size turns out to be, say, 120MB for LLVM, and we want to scale to 
> > > 10x that, and we're spending 500MB/file for ASTs, then it might well be a 
> > > good trade.
> > > The YAML dataset is not a good proxy for how big the data is (at least 
> > > without an effort to estimate constant factor).
> > 
> > Indeed. I'll try to come up with more realistic numbers. There are other 
> > things not accounted for in the 16 bytes mentioned above, like storing 
> > roles and relations.
> > 
> > > 500MB/file for ASTs
> > 
> > What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> > What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> 
> Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 
> files at the same time I it was already around 500MB. Hmm, that's a bit 
> alarming.
Right, just that we have to consider RAM usage for the index in the context of 
clangd's overall requirements - if other parts of clangd use 1G of ram for 
typical work on a large project, then we shouldn't rule out spending a couple 
of 100MB on the index if it adds a lot of value.



Comment at: clangd/Symbol.h:72
+
+  void addSymbol(Symbol S);
+  bool hasSymbol(StringRef Identifier) const;

hokein wrote:
> sammccall wrote:
> > This is dangerous if called after reads, as it invalidates iterators and 
> > pointers.
> > I don't think we actually indend to support such mutation, so I'd suggest 
> > adding an explicit freeze() function. addSymbol() is only valid before 
> > freeze(), and reading functions are only valid after. An assert can enforce 
> > this.
> > (This is a cheap version of a builder, which are more painful to write but 
> > may also be worth it).
> > 
> > If we choose not to enforce this at all, the requirement 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added inline comments.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

malaperle wrote:
> sammccall wrote:
> > hokein wrote:
> > > malaperle wrote:
> > > > sammccall wrote:
> > > > > hokein wrote:
> > > > > > malaperle wrote:
> > > > > > > I think it would be nice to have methods as an interface to get 
> > > > > > > this data instead of storing them directly. So that an 
> > > > > > > index-on-disk could go fetch the data. Especially the occurrences 
> > > > > > > which can take a lot of memory (I'm working on a branch that does 
> > > > > > > that). But perhaps defining that interface is not within the 
> > > > > > > scope of this patch and could be better discussed in D40548 ?
> > > > > > I agree. We can't load all the symbol occurrences into the memory 
> > > > > > since they are too large. We need to design interface for the 
> > > > > > symbol occurrences. 
> > > > > > 
> > > > > > We could discuss the interface here, but CodeCompletion is the main 
> > > > > > thing which this patch focuses on. 
> > > > > > We can't load all the symbol occurrences into the memory since they 
> > > > > > are too large
> > > > > 
> > > > > I've heard this often, but never backed up by data :-)
> > > > > 
> > > > > Naively an array of references for a symbol could be doc ID + offset 
> > > > > + length, let's say 16 bytes.
> > > > > 
> > > > > If a source file consisted entirely of references to 1-character 
> > > > > symbols separated by punctuation (1 reference per 2 bytes) then the 
> > > > > total size of these references would be 8x the size of the source 
> > > > > file - in practice much less. That's not very big.
> > > > > 
> > > > > (Maybe there are edge cases with macros/templates, but we can keep 
> > > > > them under control)
> > > > I'd have to break down how much memory it used by what, I'll come back 
> > > > to you on that. Indexing llvm with ClangdIndexDataStorage, which is 
> > > > pretty packed is about 200MB. That's already a lot considering we want 
> > > > to index code bases many times bigger. But I'll try to come up with 
> > > > more precise numbers. I'm open to different strategies.
> > > > 
> > > I can see two points here:
> > > 
> > > 1. For all symbol occurrences of a TU, it is not quite large, and we can 
> > > keep them in memory.
> > > 2. For all symbol occurrences of a whole project, it's not a good idea to 
> > > load all of them into memory (For LLVM project, the size of YAML dataset 
> > > is ~1.2G).  
> > (This is still a sidebar - not asking for any changes)
> > 
> > The YAML dataset is not a good proxy for how big the data is (at least 
> > without an effort to estimate constant factor).
> > And "it's not a good idea" isn't an assertion that can hold without 
> > reasons, assumptions, and data.
> > If the size turns out to be, say, 120MB for LLVM, and we want to scale to 
> > 10x that, and we're spending 500MB/file for ASTs, then it might well be a 
> > good trade.
> > The YAML dataset is not a good proxy for how big the data is (at least 
> > without an effort to estimate constant factor).
> 
> Indeed. I'll try to come up with more realistic numbers. There are other 
> things not accounted for in the 16 bytes mentioned above, like storing roles 
> and relations.
> 
> > 500MB/file for ASTs
> 
> What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 
files at the same time I it was already around 500MB. Hmm, that's a bit 
alarming.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Haojian Wu via Phabricator via cfe-commits
hokein added a comment.

Reorganizing the source files made all the comments invalid in the latest 
version :(. Feel free to comment on the old version or the new version.




Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

sammccall wrote:
> hokein wrote:
> > malaperle wrote:
> > > malaperle wrote:
> > > > sammccall wrote:
> > > > > ioeric wrote:
> > > > > > Is this relative or absolute?
> > > > > Having every symbol own a copy of the filepath seems wasteful.
> > > > > It seems likely that the index will have per-file information too, so 
> > > > > this representation should likely be a key to that. Hash of the 
> > > > > filepath might work?
> > > > How we model it is that a symbol doesn't have a "location", but its 
> > > > occurrence do. One could consider the location of a symbol to be either 
> > > > its declaration occurrence (SymbolRole::Declaration) or its definition 
> > > > (SymbolRole::Definition).
> > > > What we do to get the location path is each occurrence has a pointer (a 
> > > > "database" pointer, but it doesn't matter) to a file entry and then we 
> > > > get the path from the entry.
> > > > 
> > > > So conceptually, it works a bit like this (although it fetches 
> > > > information on disk).
> > > > ```
> > > > class IndexOccurrence {
> > > > IndexOccurrence *FilePtr;
> > > > 
> > > > std::string Occurrence::getPath() {
> > > >   return FilePtr->getPath();
> > > > }
> > > > };
> > > > ```
> > > Oops, wrong type for the field, it should have been:
> > > ```
> > > class IndexOccurrence {
> > > IndexFile *FilePtr;
> > > 
> > > std::string Occurrence::getPath() {
> > >   return FilePtr->getPath();
> > > }
> > > };
> > > ```
> > > Is this relative or absolute?
> > 
> >  Whether the file path is relative or absolute depends on the build system, 
> > the file path could be relative (for header files) or absolute (for .cc 
> > files).
> > 
> > > How we model it is that a symbol doesn't have a "location", but its 
> > > occurrence do.
> > 
> > We will also have a SymbolOccurence structure alongside with Symbol (but it 
> > is not  in this patch). The "Location" will be a part of SymbolOccurrence.
> >  
> > Whether the file path is relative or absolute depends on the build system, 
> > the file path could be relative (for header files) or absolute (for .cc 
> > files).
> 
> I'm not convinced this actually works. There's multiple codepaths to the 
> index, how can we ensure we don't end up using inconsistent paths?
> e.g. we open up a project that includes a system header using a relative 
> path, and then open up that system header from file->open (editor knows only 
> the absolute path) and do "find references".
> 
> I think we need to canonicalize the paths. Absolute is probably easiest.
Absolute path for .cc file is fine, I was a bit concerned about the path for .h 
file, especially we might use it in `#include`, but we can figure out later.  
Changed to absolute file path.



Comment at: clangd/Symbol.h:51
+  //   * For classes, the canonical location is where they are defined.
+  SymbolLocation CanonicalLocation;
+  // Information for code completion.

sammccall wrote:
> hokein wrote:
> > ioeric wrote:
> > > For functions and classes, should we store both declaration and 
> > > definition locations?
> > I think the symbol occurrences would include both declaration and 
> > definition locations. 
> > 
> > The `CanonicalLocation` is providing a fast/convenient way to find the most 
> > interested location of the symbol (e.g. for code navigation, or include the 
> > missing path for a symbol), without symbol occurrences. 
> I'd be +1 on including both a definition location (if known) and a preferred 
> declaration location, because there's enough use cases that might want 
> definition even if it's not the preferred declaration.
> 
> But i'm fine if we want to omit the separate definition for now. In that 
> case, call this CanonicalDeclaration?
OK, changed it to `CanonicalDeclarationLoc`, and added a FIXME for the 
definition.



Comment at: clangd/Symbol.h:68
+// changed.
+class SymbolCollector : public index::IndexDataConsumer {
+public:

sammccall wrote:
> hokein wrote:
> > sammccall wrote:
> > > Please pull this into a separate file. Someone providing e.g. symbols 
> > > from a YAML file shouldn't need to pull in AST stuff.
> > > Mabye `IndexFromAST`, which would sort nicely next to `Index`?
> > I can see various meaning of "Index" here:
> > 
> > 1. `Index` in `index::IndexDataConsumer`, which collects and contructs all 
> > symbols by traversing the AST.
> > 2. `Index` term using in clangd, specially for build index on top of these 
> > collected symbols.
> > 
> > I think we should be consistent the index for 2), and `SymbolCollector` is 
> > more 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 126156.
hokein marked 6 inline comments as done.
hokein added a comment.

- reorganize files, move to index subdirectory.
- change symbol ID to a hash value, instead of couple with USR.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/index/CMakeLists.txt
  clangd/index/Index.cpp
  clangd/index/Index.h
  clangd/index/SymbolCollector.cpp
  clangd/index/SymbolCollector.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- /dev/null
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,112 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "index/SymbolCollector.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::UnorderedElementsAre;
+using testing::Eq;
+using testing::Field;
+
+// GMock helpers for matching Symbol.
+MATCHER_P(QName, Name, "") { return arg.QualifiedName == Name; }
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::FrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return Action;
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->takeSymbols();
+
+EXPECT_EQ(FileName, Factory->Collector->getFilename());
+return true;
+  }
+
+protected:
+  SymbolSlab Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+  void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols, UnorderedElementsAre(QName("Foo"), QName("Foo::f"),
+QName("f1"), QName("f2")));
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -15,12 +15,14 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
   PRIVATE
   clangBasic
   clangDaemon
+  clangdIndex
   clangFormat
   clangFrontend
   clangSema
Index: clangd/index/SymbolCollector.h
===
--- /dev/null
+++ clangd/index/SymbolCollector.h
@@ -0,0 +1,50 @@
+//===--- SymbolCollector.h ---*- C++-*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added a comment.

(From https://reviews.llvm.org/D40548) Here's the interface for querying the 
index that I am using right now. It's meant to be able to retrieve from any 
kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I 
was able to implement "Open Workspace Symbol" (which is close to code 
completion in concept), Find References and Find Definitions.

  using USR = llvm::SmallString<256>;
  
  class ClangdIndexDataOccurrence;
  
  class ClangdIndexDataSymbol {
  public:
virtual index::SymbolKind getKind() = 0;
/// For example, for mynamespace::myclass::mymethod, this will be
/// mymethod.
virtual std::string getName() = 0;
/// For example, for mynamespace::myclass::mymethod, this will be
/// mynamespace::myclass::
virtual std::string getQualifier() = 0;
virtual std::string getUsr() = 0;
  
virtual void foreachOccurrence(index::SymbolRoleSet Roles, 
llvm::function_ref Receiver) = 0;
  
virtual ~ClangdIndexDataSymbol() = default;
  };
  
  class ClangdIndexDataOccurrence {
  public:
enum class OccurrenceType : uint16_t {
   OCCURRENCE,
   DEFINITION_OCCURRENCE
 };
  
virtual OccurrenceType getKind() const = 0;
virtual std::string getPath() = 0;
/// Get the start offset of the symbol occurrence. The SourceManager can be
/// used for implementations that need to convert from a line/column
/// representation to an offset.
virtual uint32_t getStartOffset(SourceManager ) = 0;
/// Get the end offset of the symbol occurrence. The SourceManager can be
/// used for implementations that need to convert from a line/column
/// representation to an offset.
virtual uint32_t getEndOffset(SourceManager ) = 0;
virtual ~ClangdIndexDataOccurrence() = default;
//TODO: Add relations
  
static bool classof(const ClangdIndexDataOccurrence *O) { return 
O->getKind() == OccurrenceType::OCCURRENCE; }
  };
  
  /// An occurrence that also has definition with a body that requires 
additional
  /// locations to keep track of the beginning and end of the body.
  class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
  public:
virtual uint32_t getDefStartOffset(SourceManager ) = 0;
virtual uint32_t getDefEndOffset(SourceManager ) = 0;
  
static bool classof(const ClangdIndexDataOccurrence *O) { return 
O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
  };
  
  class ClangdIndexDataProvider {
  public:
  
virtual void foreachSymbols(StringRef Pattern, 
llvm::function_ref Receiver) = 0;
virtual void foreachSymbols(const USR , 
llvm::function_ref Receiver) = 0;
  
virtual ~ClangdIndexDataProvider() = default;
  };

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I 
think the main points are that having generic foreachSymbols/foreachOccurrence 
with callbacks is well suited to implement multiple features with minimal 
copying.




Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

sammccall wrote:
> hokein wrote:
> > malaperle wrote:
> > > sammccall wrote:
> > > > hokein wrote:
> > > > > malaperle wrote:
> > > > > > I think it would be nice to have methods as an interface to get 
> > > > > > this data instead of storing them directly. So that an 
> > > > > > index-on-disk could go fetch the data. Especially the occurrences 
> > > > > > which can take a lot of memory (I'm working on a branch that does 
> > > > > > that). But perhaps defining that interface is not within the scope 
> > > > > > of this patch and could be better discussed in D40548 ?
> > > > > I agree. We can't load all the symbol occurrences into the memory 
> > > > > since they are too large. We need to design interface for the symbol 
> > > > > occurrences. 
> > > > > 
> > > > > We could discuss the interface here, but CodeCompletion is the main 
> > > > > thing which this patch focuses on. 
> > > > > We can't load all the symbol occurrences into the memory since they 
> > > > > are too large
> > > > 
> > > > I've heard this often, but never backed up by data :-)
> > > > 
> > > > Naively an array of references for a symbol could be doc ID + offset + 
> > > > length, let's say 16 bytes.
> > > > 
> > > > If a source file consisted entirely of references to 1-character 
> > > > symbols separated by punctuation (1 reference per 2 bytes) then the 
> > > > total size of these references would be 8x the size of the source file 
> > > > - in practice much less. That's not very big.
> > > > 
> > > > (Maybe there are edge cases with macros/templates, but we can keep them 
> > > > under control)
> > > I'd have to break down how much memory it used by what, I'll come back to 
> > > you on that. Indexing llvm with ClangdIndexDataStorage, 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

More comments, but only two major things really:

- I'd like to clearly separate USR from SymbolID (even if you want to keep 
using USRs as their basis for now)
- the file organization (code within files, and names of files) needs some work 
I think

Everything else is details, this looks good




Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

hokein wrote:
> malaperle wrote:
> > malaperle wrote:
> > > sammccall wrote:
> > > > ioeric wrote:
> > > > > Is this relative or absolute?
> > > > Having every symbol own a copy of the filepath seems wasteful.
> > > > It seems likely that the index will have per-file information too, so 
> > > > this representation should likely be a key to that. Hash of the 
> > > > filepath might work?
> > > How we model it is that a symbol doesn't have a "location", but its 
> > > occurrence do. One could consider the location of a symbol to be either 
> > > its declaration occurrence (SymbolRole::Declaration) or its definition 
> > > (SymbolRole::Definition).
> > > What we do to get the location path is each occurrence has a pointer (a 
> > > "database" pointer, but it doesn't matter) to a file entry and then we 
> > > get the path from the entry.
> > > 
> > > So conceptually, it works a bit like this (although it fetches 
> > > information on disk).
> > > ```
> > > class IndexOccurrence {
> > > IndexOccurrence *FilePtr;
> > > 
> > > std::string Occurrence::getPath() {
> > >   return FilePtr->getPath();
> > > }
> > > };
> > > ```
> > Oops, wrong type for the field, it should have been:
> > ```
> > class IndexOccurrence {
> > IndexFile *FilePtr;
> > 
> > std::string Occurrence::getPath() {
> >   return FilePtr->getPath();
> > }
> > };
> > ```
> > Is this relative or absolute?
> 
>  Whether the file path is relative or absolute depends on the build system, 
> the file path could be relative (for header files) or absolute (for .cc 
> files).
> 
> > How we model it is that a symbol doesn't have a "location", but its 
> > occurrence do.
> 
> We will also have a SymbolOccurence structure alongside with Symbol (but it 
> is not  in this patch). The "Location" will be a part of SymbolOccurrence.
>  
> Whether the file path is relative or absolute depends on the build system, 
> the file path could be relative (for header files) or absolute (for .cc 
> files).

I'm not convinced this actually works. There's multiple codepaths to the index, 
how can we ensure we don't end up using inconsistent paths?
e.g. we open up a project that includes a system header using a relative path, 
and then open up that system header from file->open (editor knows only the 
absolute path) and do "find references".

I think we need to canonicalize the paths. Absolute is probably easiest.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

hokein wrote:
> malaperle wrote:
> > sammccall wrote:
> > > hokein wrote:
> > > > malaperle wrote:
> > > > > I think it would be nice to have methods as an interface to get this 
> > > > > data instead of storing them directly. So that an index-on-disk could 
> > > > > go fetch the data. Especially the occurrences which can take a lot of 
> > > > > memory (I'm working on a branch that does that). But perhaps defining 
> > > > > that interface is not within the scope of this patch and could be 
> > > > > better discussed in D40548 ?
> > > > I agree. We can't load all the symbol occurrences into the memory since 
> > > > they are too large. We need to design interface for the symbol 
> > > > occurrences. 
> > > > 
> > > > We could discuss the interface here, but CodeCompletion is the main 
> > > > thing which this patch focuses on. 
> > > > We can't load all the symbol occurrences into the memory since they are 
> > > > too large
> > > 
> > > I've heard this often, but never backed up by data :-)
> > > 
> > > Naively an array of references for a symbol could be doc ID + offset + 
> > > length, let's say 16 bytes.
> > > 
> > > If a source file consisted entirely of references to 1-character symbols 
> > > separated by punctuation (1 reference per 2 bytes) then the total size of 
> > > these references would be 8x the size of the source file - in practice 
> > > much less. That's not very big.
> > > 
> > > (Maybe there are edge cases with macros/templates, but we can keep them 
> > > under control)
> > I'd have to break down how much memory it used by what, I'll come back to 
> > you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty 
> > packed is about 200MB. That's already a lot considering we want to index 
> > code bases many times bigger. But I'll try to come up with more precise 
> > numbers. I'm open to different 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Haojian Wu via Phabricator via cfe-commits
hokein added a comment.

Thanks for the useful comments!




Comment at: clangd/Symbol.h:1
+//===--- Symbol.h ---*- C++-*-===//
+//

sammccall wrote:
> sammccall wrote:
> > I think that:
> >  - there's other places in clangd that deal with symbols too, this is 
> > specifically for indexing
> >  - the index interface belongs alongside Symbol
> > I'd suggest calling this Index.h
> I don't think having `Symbol`s be completely self-contained objects and 
> passing them around in standard containers like `set`s will prove to be ideal.
> It means they can't share storage for e.g. location filename, that it's hard 
> for us to arena-allocate them, etc.
> 
> I think we could use the concept of a set of symbols which share a lifetime. 
> An initial version might just be
> 
> class SymbolSlab {
> public:
>   using iterator = DenseSet::iterator;
>   iterator begin();
>   iterator end();
> private:
>   DenseSet Symbols;
> }
> 
> But it's easy to add `StringPool` etc there.
> Then this is the natural unit of granularity of large sets of symbols:  a 
> dynamic index that deals with one file at a time would operate on (Filename, 
> SymbolSlab) pairs. SymbolCollector would return a SymbolSlab, etc.
> 
> Then indexes can be built on top of this using non-owning pointers.
+1 It makes sense. 

For initial version, the `Symbol` structure is still owning its fields naively, 
we could improve it (change to pointer or references) in the future.



Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

malaperle wrote:
> malaperle wrote:
> > sammccall wrote:
> > > ioeric wrote:
> > > > Is this relative or absolute?
> > > Having every symbol own a copy of the filepath seems wasteful.
> > > It seems likely that the index will have per-file information too, so 
> > > this representation should likely be a key to that. Hash of the filepath 
> > > might work?
> > How we model it is that a symbol doesn't have a "location", but its 
> > occurrence do. One could consider the location of a symbol to be either its 
> > declaration occurrence (SymbolRole::Declaration) or its definition 
> > (SymbolRole::Definition).
> > What we do to get the location path is each occurrence has a pointer (a 
> > "database" pointer, but it doesn't matter) to a file entry and then we get 
> > the path from the entry.
> > 
> > So conceptually, it works a bit like this (although it fetches information 
> > on disk).
> > ```
> > class IndexOccurrence {
> > IndexOccurrence *FilePtr;
> > 
> > std::string Occurrence::getPath() {
> >   return FilePtr->getPath();
> > }
> > };
> > ```
> Oops, wrong type for the field, it should have been:
> ```
> class IndexOccurrence {
> IndexFile *FilePtr;
> 
> std::string Occurrence::getPath() {
>   return FilePtr->getPath();
> }
> };
> ```
> Is this relative or absolute?

 Whether the file path is relative or absolute depends on the build system, the 
file path could be relative (for header files) or absolute (for .cc files).

> How we model it is that a symbol doesn't have a "location", but its 
> occurrence do.

We will also have a SymbolOccurence structure alongside with Symbol (but it is 
not  in this patch). The "Location" will be a part of SymbolOccurrence.
 



Comment at: clangd/Symbol.h:26
+  // source file.
+  unsigned StartOffset;
+  // The offset to the last character of the symbol from the beginning of the

ioeric wrote:
> 0-based or 1-based?
The offset is equivalent to FileOffset in `SourceLocation`. I can't find any 
document about the FileOffset in LLVM, but it is 0-based.



Comment at: clangd/Symbol.h:32
+
+struct CodeCompletionInfo {
+  // FIXME: add fields here.

sammccall wrote:
> Let's not add this until we know what's in it.
> There's gong to be an overlap between information needed for CC and other use 
> cases, so this structure might not help the user navigate.
Removed it.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

malaperle wrote:
> sammccall wrote:
> > hokein wrote:
> > > malaperle wrote:
> > > > I think it would be nice to have methods as an interface to get this 
> > > > data instead of storing them directly. So that an index-on-disk could 
> > > > go fetch the data. Especially the occurrences which can take a lot of 
> > > > memory (I'm working on a branch that does that). But perhaps defining 
> > > > that interface is not within the scope of this patch and could be 
> > > > better discussed in D40548 ?
> > > I agree. We can't load all the symbol occurrences into the memory since 
> > > they are too large. We need to design 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-08 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 126115.
hokein marked 4 inline comments as done.
hokein added a comment.

Address review comments.

- Use SymbolSlab, for allowing future space optimization.
- Fix a Cast issue when building debug unittest.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/Symbol.cpp
  clangd/Symbol.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- /dev/null
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,112 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "Symbol.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::ElementsAre;
+using testing::Eq;
+using testing::Field;
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::FrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return Action;
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->takeSymbols();
+
+EXPECT_EQ(FileName, Factory->Collector->getFilename());
+return true;
+  }
+
+protected:
+  SymbolSlab Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+  void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols,
+  UnorderedElementsAre(Field(::QualifiedName, Eq("Foo")),
+   Field(::QualifiedName, Eq("Foo::f")),
+   Field(::QualifiedName, Eq("f1")),
+   Field(::QualifiedName, Eq("f2";
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -15,6 +15,7 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
Index: clangd/Symbol.h
===
--- /dev/null
+++ clangd/Symbol.h
@@ -0,0 +1,133 @@
+//===--- Symbol.h ---*- C++-*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===-===//
+
+#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_SYMBOL_H
+#define 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-07 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added inline comments.



Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

malaperle wrote:
> sammccall wrote:
> > ioeric wrote:
> > > Is this relative or absolute?
> > Having every symbol own a copy of the filepath seems wasteful.
> > It seems likely that the index will have per-file information too, so this 
> > representation should likely be a key to that. Hash of the filepath might 
> > work?
> How we model it is that a symbol doesn't have a "location", but its 
> occurrence do. One could consider the location of a symbol to be either its 
> declaration occurrence (SymbolRole::Declaration) or its definition 
> (SymbolRole::Definition).
> What we do to get the location path is each occurrence has a pointer (a 
> "database" pointer, but it doesn't matter) to a file entry and then we get 
> the path from the entry.
> 
> So conceptually, it works a bit like this (although it fetches information on 
> disk).
> ```
> class IndexOccurrence {
> IndexOccurrence *FilePtr;
> 
> std::string Occurrence::getPath() {
>   return FilePtr->getPath();
> }
> };
> ```
Oops, wrong type for the field, it should have been:
```
class IndexOccurrence {
IndexFile *FilePtr;

std::string Occurrence::getPath() {
  return FilePtr->getPath();
}
};
```


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-07 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added inline comments.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

sammccall wrote:
> hokein wrote:
> > malaperle wrote:
> > > I think it would be nice to have methods as an interface to get this data 
> > > instead of storing them directly. So that an index-on-disk could go fetch 
> > > the data. Especially the occurrences which can take a lot of memory (I'm 
> > > working on a branch that does that). But perhaps defining that interface 
> > > is not within the scope of this patch and could be better discussed in 
> > > D40548 ?
> > I agree. We can't load all the symbol occurrences into the memory since 
> > they are too large. We need to design interface for the symbol occurrences. 
> > 
> > We could discuss the interface here, but CodeCompletion is the main thing 
> > which this patch focuses on. 
> > We can't load all the symbol occurrences into the memory since they are too 
> > large
> 
> I've heard this often, but never backed up by data :-)
> 
> Naively an array of references for a symbol could be doc ID + offset + 
> length, let's say 16 bytes.
> 
> If a source file consisted entirely of references to 1-character symbols 
> separated by punctuation (1 reference per 2 bytes) then the total size of 
> these references would be 8x the size of the source file - in practice much 
> less. That's not very big.
> 
> (Maybe there are edge cases with macros/templates, but we can keep them under 
> control)
I'd have to break down how much memory it used by what, I'll come back to you 
on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is 
about 200MB. That's already a lot considering we want to index code bases many 
times bigger. But I'll try to come up with more precise numbers. I'm open to 
different strategies.



Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-07 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added a comment.

In https://reviews.llvm.org/D40897#946911, @hokein wrote:

> Our rough plan would be
>
> 1. Define the Symbol structure.
> 2. Design the interfaces of SymbolIndex, ASTIndex.
> 3. Combine 1) and 2) together to make global code completion work (we'd use 
> YAML dataset for LLVM project, note that this is not a final solution, it 
> would be hidden in an `--experimental` flag).
> 4. Switch to use the dataset from index-while-building when it is ready.


Thanks for the explanation. On my end, the plan is not quite sequential. The 
branch I am developing has interfaces for querying, building and a dataset 
format, let's call it ClangdIndexDataStorage, which is different from 
index-while-building (libIndexStore). I also have a version that uses 
libIndexStore through the same interfaces. So with that in mind, there are too 
main activities:

1. Work towards the interfaces for using the index (this patch, and Eric's). 
From my perspective, I will make sure that it can be as compatible as possible 
with reading the index from disk and the features we want to develop. One 
important aspect is to have a good balance between memory and performance. In 
Eclipse CDT and also the branch I work on using ClangdIndexDataStorage, the 
emphasis was to minimize memory consumption and have a configurable cache size. 
But different choices could be made here, perhaps I can start a discussion 
about that separately.
2. Work on index-while-building or the other format getting adopted in Clangd. 
The index-while-building (libIndexStore) is promising but also has a few 
missing pieces. We need a mapping solution (LMDB equivalent). We also need to 
make sure it's fast enough and contain enough information for the features we 
will need, etc.




Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

sammccall wrote:
> ioeric wrote:
> > Is this relative or absolute?
> Having every symbol own a copy of the filepath seems wasteful.
> It seems likely that the index will have per-file information too, so this 
> representation should likely be a key to that. Hash of the filepath might 
> work?
How we model it is that a symbol doesn't have a "location", but its occurrence 
do. One could consider the location of a symbol to be either its declaration 
occurrence (SymbolRole::Declaration) or its definition (SymbolRole::Definition).
What we do to get the location path is each occurrence has a pointer (a 
"database" pointer, but it doesn't matter) to a file entry and then we get the 
path from the entry.

So conceptually, it works a bit like this (although it fetches information on 
disk).
```
class IndexOccurrence {
IndexOccurrence *FilePtr;

std::string Occurrence::getPath() {
  return FilePtr->getPath();
}
};
```


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-07 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

Thanks for putting this together! Have a bit of a braindump here, happy to 
discuss further either here or offline.




Comment at: clangd/Symbol.h:1
+//===--- Symbol.h ---*- C++-*-===//
+//

I think that:
 - there's other places in clangd that deal with symbols too, this is 
specifically for indexing
 - the index interface belongs alongside Symbol
I'd suggest calling this Index.h



Comment at: clangd/Symbol.h:1
+//===--- Symbol.h ---*- C++-*-===//
+//

sammccall wrote:
> I think that:
>  - there's other places in clangd that deal with symbols too, this is 
> specifically for indexing
>  - the index interface belongs alongside Symbol
> I'd suggest calling this Index.h
I don't think having `Symbol`s be completely self-contained objects and passing 
them around in standard containers like `set`s will prove to be ideal.
It means they can't share storage for e.g. location filename, that it's hard 
for us to arena-allocate them, etc.

I think we could use the concept of a set of symbols which share a lifetime. An 
initial version might just be

class SymbolSlab {
public:
  using iterator = DenseSet::iterator;
  iterator begin();
  iterator end();
private:
  DenseSet Symbols;
}

But it's easy to add `StringPool` etc there.
Then this is the natural unit of granularity of large sets of symbols:  a 
dynamic index that deals with one file at a time would operate on (Filename, 
SymbolSlab) pairs. SymbolCollector would return a SymbolSlab, etc.

Then indexes can be built on top of this using non-owning pointers.



Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

ioeric wrote:
> Is this relative or absolute?
Having every symbol own a copy of the filepath seems wasteful.
It seems likely that the index will have per-file information too, so this 
representation should likely be a key to that. Hash of the filepath might work?



Comment at: clangd/Symbol.h:32
+
+struct CodeCompletionInfo {
+  // FIXME: add fields here.

Let's not add this until we know what's in it.
There's gong to be an overlap between information needed for CC and other use 
cases, so this structure might not help the user navigate.



Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

hokein wrote:
> malaperle wrote:
> > I think it would be nice to have methods as an interface to get this data 
> > instead of storing them directly. So that an index-on-disk could go fetch 
> > the data. Especially the occurrences which can take a lot of memory (I'm 
> > working on a branch that does that). But perhaps defining that interface is 
> > not within the scope of this patch and could be better discussed in D40548 ?
> I agree. We can't load all the symbol occurrences into the memory since they 
> are too large. We need to design interface for the symbol occurrences. 
> 
> We could discuss the interface here, but CodeCompletion is the main thing 
> which this patch focuses on. 
> We can't load all the symbol occurrences into the memory since they are too 
> large

I've heard this often, but never backed up by data :-)

Naively an array of references for a symbol could be doc ID + offset + length, 
let's say 16 bytes.

If a source file consisted entirely of references to 1-character symbols 
separated by punctuation (1 reference per 2 bytes) then the total size of these 
references would be 8x the size of the source file - in practice much less. 
That's not very big.

(Maybe there are edge cases with macros/templates, but we can keep them under 
control)



Comment at: clangd/Symbol.h:39
+  // The symbol identifier, using USR.
+  std::string Identifier;
+  // The qualified name of the symbol, e.g. Foo::bar.

ioeric wrote:
> It might make sense to just call this `USR` to be more explicit.
USRs are large. Can we use a fixed-size hash?



Comment at: clangd/Symbol.h:55
+
+  bool operator<(const Symbol& S) const {
+return Identifier < S.Identifier;

I'd suggest == and a hash function instead, unless we think this ordering is 
particularly meaningful?



Comment at: clangd/Symbol.h:68
+// changed.
+class SymbolCollector : public index::IndexDataConsumer {
+public:

Please pull this into a separate file. Someone providing e.g. symbols from a 
YAML file shouldn't need to pull in AST stuff.
Mabye `IndexFromAST`, which would sort nicely next to `Index`?



Comment at: clangd/Symbol.h:72
+
+  const std::set () const 

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-07 Thread Eric Liu via Phabricator via cfe-commits
ioeric added inline comments.



Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the

Is this relative or absolute?



Comment at: clangd/Symbol.h:26
+  // source file.
+  unsigned StartOffset;
+  // The offset to the last character of the symbol from the beginning of the

0-based or 1-based?



Comment at: clangd/Symbol.h:39
+  // The symbol identifier, using USR.
+  std::string Identifier;
+  // The qualified name of the symbol, e.g. Foo::bar.

It might make sense to just call this `USR` to be more explicit.



Comment at: clangd/Symbol.h:51
+  //   * For classes, the canonical location is where they are defined.
+  SymbolLocation CanonicalLocation;
+  // Information for code completion.

For functions and classes, should we store both declaration and definition 
locations?


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-06 Thread Haojian Wu via Phabricator via cfe-commits
hokein added a comment.

In https://reviews.llvm.org/D40897#946708, @malaperle wrote:

> Hi! Have you looked into https://reviews.llvm.org/D40548 ? Maybe we need to 
> coordinate the two a bit.


Hi Marc! Thanks for the input!

Yeah, Eric and I are working closely on a prototype of global code completion. 
We have implemented the initial version (see github 
),
 and the prototype works well for LLVM project (even with a simple 
implementation), so we plan to split the patch, improve the code, and 
contribute it back to clangd repo incrementally.

For the prototype, we will load all symbols (without occurrences) into the 
memory, and build an in-memory index. From our experiment, the dataset of LLVM 
project in YAML format is ~120MB (~38,000 symbols), which is acceptable in 
clangd.

Our rough plan would be

1. Define the Symbol structure.
2. Design the interfaces of SymbolIndex, ASTIndex.
3. Combine 1) and 2) together to make global code completion work (we'd use 
YAML dataset for LLVM project, note that this is not a final solution, it would 
be hidden in an `--experimental` flag).
4. Switch to use the dataset from index-while-building when it is ready.




Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

malaperle wrote:
> I think it would be nice to have methods as an interface to get this data 
> instead of storing them directly. So that an index-on-disk could go fetch the 
> data. Especially the occurrences which can take a lot of memory (I'm working 
> on a branch that does that). But perhaps defining that interface is not 
> within the scope of this patch and could be better discussed in D40548 ?
I agree. We can't load all the symbol occurrences into the memory since they 
are too large. We need to design interface for the symbol occurrences. 

We could discuss the interface here, but CodeCompletion is the main thing which 
this patch focuses on. 


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-06 Thread Marc-Andre Laperle via Phabricator via cfe-commits
malaperle added a comment.

Hi! Have you looked into https://reviews.llvm.org/D40548 ? Maybe we need to 
coordinate the two a bit.




Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.

I think it would be nice to have methods as an interface to get this data 
instead of storing them directly. So that an index-on-disk could go fetch the 
data. Especially the occurrences which can take a lot of memory (I'm working on 
a branch that does that). But perhaps defining that interface is not within the 
scope of this patch and could be better discussed in D40548 ?


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D40897: [clangd] Introduce a "Symbol" class.

2017-12-06 Thread Haojian Wu via Phabricator via cfe-commits
hokein created this revision.
Herald added subscribers: mgorny, klimek.

- The "Symbol" class represents a C++ symbol in the codebase, containing all 
the information of a C++ symbol needed by clangd. clangd will use it in 
clangd's AST/dynamic index and global/static index (code completion and code 
navigation).
- The SymbolCollector (another IndexAction) will be used to recollect the 
symbols when the source file is changed (for ASTIndex), or to generate all C++ 
symbols for the whole project.

In the long term (when index-while-building is ready), clangd should share a
same "Symbol" structure and IndexAction with index-while-building, but
for now we want to have some stuff working in clangd.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897

Files:
  clangd/CMakeLists.txt
  clangd/Symbol.cpp
  clangd/Symbol.h
  unittests/clangd/CMakeLists.txt
  unittests/clangd/SymbolCollectorTests.cpp

Index: unittests/clangd/SymbolCollectorTests.cpp
===
--- /dev/null
+++ unittests/clangd/SymbolCollectorTests.cpp
@@ -0,0 +1,110 @@
+//===-- SymbolCollectorTests.cpp  ---*- C++ -*-===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===--===//
+
+#include "Symbol.h"
+#include "clang/Index/IndexingAction.h"
+#include "clang/Basic/FileManager.h"
+#include "clang/Basic/FileSystemOptions.h"
+#include "clang/Basic/VirtualFileSystem.h"
+#include "clang/Frontend/CompilerInstance.h"
+#include "clang/Tooling/Tooling.h"
+#include "llvm/ADT/IntrusiveRefCntPtr.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "gtest/gtest.h"
+#include "gmock/gmock.h"
+
+#include 
+#include 
+
+using testing::ElementsAre;
+using testing::Eq;
+using testing::Field;
+
+namespace clang {
+namespace clangd {
+
+namespace {
+class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
+ public:
+  SymbolIndexActionFactory() = default;
+
+  clang::ASTFrontendAction *create() override {
+index::IndexingOptions IndexOpts;
+IndexOpts.SystemSymbolFilter =
+index::IndexingOptions::SystemSymbolFilterKind::All;
+IndexOpts.IndexFunctionLocals = false;
+Collector = std::make_shared();
+FrontendAction *Action =
+index::createIndexingAction(Collector, IndexOpts, nullptr).release();
+return llvm::cast(Action);
+  }
+
+  std::shared_ptr Collector;
+};
+
+class SymbolCollectorTest : public ::testing::Test {
+public:
+  bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
+llvm::IntrusiveRefCntPtr InMemoryFileSystem(
+new vfs::InMemoryFileSystem);
+llvm::IntrusiveRefCntPtr Files(
+new FileManager(FileSystemOptions(), InMemoryFileSystem));
+
+const std::string FileName = "symbol.cc";
+const std::string HeaderName = "symbols.h";
+auto Factory = llvm::make_unique();
+
+tooling::ToolInvocation Invocation(
+{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
+Factory->create(), Files.get(),
+std::make_shared());
+
+InMemoryFileSystem->addFile(HeaderName, 0,
+llvm::MemoryBuffer::getMemBuffer(HeaderCode));
+
+std::string Content = "#include\"" + std::string(HeaderName) + "\"";
+Content += "\n" + MainCode.str();
+InMemoryFileSystem->addFile(FileName, 0,
+llvm::MemoryBuffer::getMemBuffer(Content));
+Invocation.run();
+Symbols = Factory->Collector->getSymbols();
+return true;
+  }
+
+protected:
+  std::set Symbols;
+};
+
+TEST_F(SymbolCollectorTest, CollectSymbol) {
+  const std::string Header = R"(
+class Foo {
+ void f();
+};
+void f1();
+inline void f2() {}
+  )";
+  const std::string Main = R"(
+namespace {
+void ff() {} // ignore
+}
+void f1() {}
+  )";
+  runSymbolCollector(Header, Main);
+  EXPECT_THAT(Symbols,
+  UnorderedElementsAre(Field(::QualifiedName, Eq("Foo")),
+   Field(::QualifiedName, Eq("Foo::f")),
+   Field(::QualifiedName, Eq("f1")),
+   Field(::QualifiedName, Eq("f2";
+}
+
+} // namespace
+} // namespace clangd
+} // namespace clang
Index: unittests/clangd/CMakeLists.txt
===
--- unittests/clangd/CMakeLists.txt
+++ unittests/clangd/CMakeLists.txt
@@ -15,6 +15,7 @@
   JSONExprTests.cpp
   TestFS.cpp
   TraceTests.cpp
+  SymbolCollectorTests.cpp
   )
 
 target_link_libraries(ClangdTests
Index: clangd/Symbol.h
===
--- /dev/null
+++ clangd/Symbol.h
@@ -0,0 +1,87 @@
+//===---