[PATCH] D121150: [pseudo][WIP] Implement a GLR parser.

Sam McCall via Phabricator via cfe-commits Tue, 08 Mar 2022 08:03:08 -0800

sammccall added inline comments.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:30
+// A node in a forest.
+class ForestNode {
+public:
----------------
sammccall wrote:
> I wonder if we should put these in a namespace `forest::Node` rather than 
> giving them the "Forest" prefix?
alignas(ForestNode*)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:67
+  std::string Dump(const Grammar &) const;
+  std::string DumpRecursive(const Grammar &, bool abbreviated = false) const;
+
----------------
dump(), dumpRecursive()


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:78
+  }
+  static uint16_t SequenceData(RuleID rule,
+                               llvm::ArrayRef<const ForestNode *> elements) {
----------------
Rule, Elements


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:90
+  // Terminal - unused
+  uint16_t Data_;
+  // A trailing array of Node* .
----------------
remove trailing _


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:91
+  uint16_t Data_;
+  // A trailing array of Node* .
+};
----------------
ForestNode*


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:100
+public:
+  size_t nodeNum() const { return NodeNum; }
+  size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); }
----------------
count()?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:40
+
+// An implementation of a directed acyclic graph (DAG), used as a
+// graph-structured stack (GSS) in the GLR parser.
----------------
the fact that this uses a DAG seems less important than what it represents. I 
think the next sentence is a better introduction, and then we should describe 
what data the stacks are modelling, finally we should describe the data 
structure after (see comment below).


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43
+//
+// GSS is an efficient data structure to represent multiple active stacks, it
+// employs a stack-combination optimization to avoid potentially exponential
----------------
as with forest, I think this places too much emphasis on the memory-compactness 
when it's not the main benefit. (If RAM were free we still wouldn't want a big 
array of stacks).


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43
+//
+// GSS is an efficient data structure to represent multiple active stacks, it
+// employs a stack-combination optimization to avoid potentially exponential
----------------
sammccall wrote:
> as with forest, I think this places too much emphasis on the 
> memory-compactness when it's not the main benefit. (If RAM were free we still 
> wouldn't want a big array of stacks).
To address the two above comments, maybe something like.

```
A Graph-Structured Stack represents the multiple parse stacks for a generalized 
LR parser.

Each node stores a parse state, the last parsed ForestNode, and its parent(s).
There may be several heads (top of stack), and the parser operates by:
 - shift: pushing terminal symbols on top of the stack
 - reduce: replace N symbols on top of the stack with one nonterminal

The structure is a DAG rather than a stack:
- GLR allows multiple actions on the same head, producing forks (nodes with the 
same parent).
- The parser reconciles nodes with the same (state, ForestNode), producing 
joins (node with multiple parents).

The parser is responsible for creating nodes, and keeping track of the set of 
heads.
```


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:46
+// growth of the stack:
+//  - combing equal stack prefixes -- A new stack doesn't need to have a full
+//    copy of its parent’s stack. They share a common prefix.
----------------
combing -> combining

I wonder if combining is the right explanation here - unlike local ambiguity 
packing, it's not like we produce two things and then merge them.

What about rather: `sharing stack prefixes: when the parser must take two 
conflicting paths, it creates two new stack head nodes with the same parent.`


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48
+//    copy of its parent’s stack. They share a common prefix.
+//  - combing euqal stack suffices -- as there are a finite number of DFA's
+//    state the parser can be in. A set of heads can be in the same state
----------------
combing -> combining
euqal -> equal


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48
+//    copy of its parent’s stack. They share a common prefix.
+//  - combing euqal stack suffices -- as there are a finite number of DFA's
+//    state the parser can be in. A set of heads can be in the same state
----------------
sammccall wrote:
> combing -> combining
> euqal -> equal
the "-- as..." clause is a bit confusing and doesn't seem necessary.

I think what's missing here is a high level explanation of what's going on in 
the parse to trigger this, and explicitly mentioning this is how we get a node 
with multiple parents.

When alternative parses converge on the same interpretation of later tokens, 
their stack heads end up  in the same state. These are merged, resulting in a 
single head with several parents.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:53
+//
+// E.g. we have two active stacks:
+//   0 -> 1 -> 2
----------------
this shows forking but not merging
and in particular it suggests that #heads == #stacks, which is not the case


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:54
+// E.g. we have two active stacks:
+//   0 -> 1 -> 2
+//        |    ^ head1, representing a stack [2, 1, 0]
----------------
arrows are pointing the wrong way (at least opposite to our pointers)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:58
+//             ^ head2, representing a stack [3, 1, 0]
+struct Graph {
+  // Represents a node in the graph.
----------------
the name "Graph" is IMO too general, this isn't a reusable graph or even dag 
class.

I think GSS is fine, it's a distinctive name from the literature, and the 
expansion of the abbreviation is nice and descriptive (but too verbose to be 
the actual name)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:59
+struct Graph {
+  // Represents a node in the graph.
+  struct Node {
----------------
this comment doesn't say anything, remove?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:60
+  // Represents a node in the graph.
+  struct Node {
+    // The parsing state presented by the graph node.
----------------
alignas(Node*)

(in practice this will match ForestNode* so is a no-op, but it documents the 
intent)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:61
+  struct Node {
+    // The parsing state presented by the graph node.
+    LRTable::StateID State : LRTable::StateBits;
----------------
again, drop comment unless there's something to say


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:64
+    static constexpr unsigned PredecessorBits = 3;
+    // Number of the predecessors of the node.
+    // u is the predecessor of v, if u -> v.
----------------
This is not what predecessor usually means in graph theory: u is the 
predecessor of v if there is *some path* from u->v.

I think "parent" is a common and well-understood term.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:68
+    // The forest node for a termina/nonterminal symbol.
+    // The symbol correponds to the label of edges which leads to current node
+    // from the predecessor nodes.
----------------
Also not sure what this comment says:
 - first line just repeats the type: type is a forest node, forest nodes are 
always for symbols, and terminal/nonterminal are all the possibilities
 - second line is either referring to or defining (not sure) edge labels, but 
edge labels aren't defined or referred to anywhere else, so why?

Maybe:
```
The parse tree for the last symbol we parsed.
This symbol appears on the left of the dot in the parse state.
```


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:77
+
+    bool operator==(const Node &L) const {
+      return State == L.State && predecessors() == L.predecessors();
----------------
why is Parsed not part of the identity?
(maybe there's a good reason, but there should be a comment)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:87
+                      llvm::ArrayRef<const Node *> Predecessors) {
+    assert(Predecessors.size() < (1 << Node::PredecessorBits) &&
+           "Too many predecessors to fit in PredecessorBits!");
----------------
8 parents isn't that many and it seems like this might be a dynamic property of 
the input code rather than a static property of the grammar.

But I don't think this bitpacking is buying anything, it looks like the layout 
is:
 - State : 13
 - PredecessorCount : 3
 - (padding) : 48
 - Parsed : 64

So I think we might as well just use a uint16 PredecessorCount and still have 
room left over for a uint16 refcount later. Is there anything else we might 
usefully store in the extra space?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:100
+  size_t bytes() const { return Arena.getTotalMemory() + sizeof(*this); }
+  size_t nodeCount() const { return NodeCount; }
+
----------------
name consistently with forest arena


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:112
+
+  const ForestNode *parse(const TokenStream &Code);
+
----------------
as discussed offline, the signature here should involve a token range, or 
likely an ArrayRef<ForestNode> for the relevant terminals.

Probably a FIXME to allow the start symbol to be specified.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:114
+
+  const Graph &getGSS() const { return GSS; }
+
----------------
sammccall wrote:
> this function never exposes an interesting graph, as it's mostly empty before 
> + after parse().
> It's only really useful for working out how much memory is allocated.
> 
> If we were to drop this, the public interface of GLRParser would be a single 
> function (so it need not be a public class) and Graph would disappear 
> entirely!
> 
> With this in mind, it seems like we could replace this header file with 
> something like:
> 
> ```
> struct ParseStats {
>   unsigned GSSBytes = 0;
> };
> ForestNode *glrParse(const TokenStream &, const LRTable&, const Grammar&, 
> ForestArena&, ParseStats *Stats = nullptr);
> ```
> 
> It feels like hiding the GSS might be an obstacle to debugging/testing.
> However this really would need a story/API around how to interrupt the parser.
> LLVM_DEBUG seems reasonable to me for now.
After offline discussion I think we either want:
 - to hide GSS as mentioned above, and just test the forest output
 - expose GSS class, and have a two versions of the parse function: one that 
finalizes the GSS into a result ForestNode, and a "raw" one that just returns 
the GSS. Then we can get at the GSS state at a point by running the raw parser 
on some prefix of the code. (Or even *only* have the raw one and make it the 
caller's responsibility to extract the node from the GSS, but this is probably 
silly)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:129
+  // Frontier is to track all avaiable actions from all active stack heads.
+  struct Frontier {
+    // A corresponding stack head.
----------------
I love the name frontier!
Unfortunately the word refers to the whole boundary/set, rather than individual 
elements of it.

Maybe this?
```
/// The list of actions we're ready to perform.
struct {
  std::vector<Pending> Shift;
  std::vector<Pending> Reduce;
  std::vector<Pending> Accept;
  bool empty(); // accept is an odd-one-out here, but I think it's going away?
} Frontier;
```

Also maybe a FIXME that the frontier needs to order reductions so that local 
ambiguity packing is guaranteed to work as a single pass


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:133
+    // An action associated with the Head.
+    const LRTable::Action *PerformAction = nullptr;
+  };
----------------
just Action?


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Forest.cpp:30
+  }
+  llvm_unreachable("unhandle node kind!");
+}
----------------
unhandled


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:33
+  ParsedForest.init(Code);
+  const Token *Lookahead = &Code.tokens().front();
+  addActions(GSS.addNode(/*StartState*/ 0, nullptr, {}), *Lookahead);
----------------
sammccall wrote:
> It actually currently looks like it would be at least as natural to have the 
> parser operate on the sequence of terminal ForestNodes rather than on the 
> Tokens themselves...
I'd suggest calling this NextTok instead of Lookahead:
 - in common english this is a verb rather than a noun
 - in parsers it more often refers to the *number* of tokens in the tables than 
the tokens themselves
 - referring to the token you're *currently shifting* as "ahead" is bizarre!


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:67
+
+std::vector<const Graph::Node *>
+GLRParser::performShift(Token::Index Lookahead) {
----------------
if this is a hot loop we shouldn't be creating std::vectors and throwing them 
away


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:80
+
+  // Merge the stack -- if multiple stack heads are going to shift a same
+  // state, we perform the shift only once by combining these heads.
----------------
we shift *tokens*, not states: the token conceptually moves from the input to 
the stack. (This seems to be other references not just me!)

if multiple stack heads will reach the same state after shifting a token?


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:85
+  // state 2 and state 3:
+  //   0 -> 1 -> 2
+  //        ` -> 3
----------------
I think the arrows directions aren't particularly clear (our pointers go the 
opposite way) or necessary here, and backticks and up arrows are hard to read.

A couple of (widely-supported) box-drawing characters would help a lot I think:

```
0--1--2
   └--3

0--1--2--4
   └--3--┘

```

(I would keep using `-` and `|` rather than making everything pretty boxes, 
it's readable enough and easier to edit. We may want to consider box-drawing 
characters for dump functions though...)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:93
+  // Shifts are partitioned by the shift state, so each partition (per loop
+  // iteration) corresponds to a "perform" shift.
+  llvm::sort(ShiftList, [](const Frontier &L, const Frontier &R) {
----------------
I can't parse `a "perform" shift`.

Batch shifts by target state so we can merge matching groups?


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:97
+           R.PerformAction->kind() == LRTable::Action::Shift);
+    return std::forward_as_tuple(L.PerformAction->getShiftState(), L.Head) <
+           std::forward_as_tuple(R.PerformAction->getShiftState(), R.Head);
----------------
I don't think tiebreaking by `Head` achieves anything.

If we really need to ensure deterministic behavior, comparing pointers won't do 
that as allocation may vary.
On the other hand, `stable_sort` should work: we'll tiebreak by order enqueued, 
so we're deterministic if our inputs are.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:108
+    // Predecessors of the new head in GSS.
+    std::vector<const Graph::Node *> Predecessors;
+    llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
----------------
SmallVector


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:110
+    llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
+      assert(llvm::find(Predecessors, F.Head) == Predecessors.end() &&
+             "Unexpected duplicated stack heads during shift!");
----------------
llvm::is_contained


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:114
+    });
+    const auto *Head = GSS.addNode(NextState, Leaf, Predecessors);
+    LLVM_DEBUG(llvm::dbgs()
----------------
if this turns out to be hot, you could avoid the temporary array of 
predecessors: return a mutable node from addNode


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:128
+static std::vector<std::string>
+getStateString(llvm::ArrayRef<const Graph::Node *> A) {
+  std::vector<std::string> States;
----------------
name doesn't match return type


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:131
+  for (const auto &N : A)
+    States.push_back(llvm::formatv("state {0}", N->State));
+  return States;
----------------
I suspect `S{0}` will be verbose enough, it's not likely spelling out `state` 
will make this much less cryptic, but it may be harder to scan


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:141
+  assert(PathStorage.empty() && "PathStorage must be empty!");
+  std::function<void(const Graph::Node *, unsigned)> enumPath =
+      [&CB, &PathStorage, &enumPath](const Graph::Node *Current,
----------------
We're leaning on std::function a lot for code that's supposedly in the hot path.
It's probably fine, but I'd like to see this written more directly.

As far as I can tell, this could easily be a class, with enumerateReducePath a 
(recursive) method that concretely calls into handleReducePath() or something 
at the bottom.

Also if it's a class we can easily reset() instead of destroying it if we want 
to reuse its vectors across separate reductions.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:158
+// heads.
+void GLRParser::performReduction(const Token &Lookahead) {
+  if (!ReduceList.empty())
----------------
(just a note, I need to dig into this part tomorrow!)


================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33
                             desc("Print the LR table for the grammar"));
+static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
+                                  init(""));
----------------
nit: a C++ source file => a source file


================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33
                             desc("Print the LR table for the grammar"));
+static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
+                                  init(""));
----------------
sammccall wrote:
> nit: a C++ source file => a source file
Why is this a new flag independent of `Source`?

It also seems inconsistent with other flags, which describe what output is 
requested rather than what computation should be done.

Maybe "-print-forest" and "-print-stats"? (which could also potentially control 
stats of other stages)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121150/new/

https://reviews.llvm.org/D121150

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D121150: [pseudo][WIP] Implement a GLR parser.

Reply via email to