sammccall added inline comments.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:30
+// A node in a forest.
+class ForestNode {
+public:
----------------
sammccall wrote:
> I wonder if we should put these in a namespace `forest::Node` rather than
> giving them the "Forest" prefix?
alignas(ForestNode*)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:67
+ std::string Dump(const Grammar &) const;
+ std::string DumpRecursive(const Grammar &, bool abbreviated = false) const;
+
----------------
dump(), dumpRecursive()
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:78
+ }
+ static uint16_t SequenceData(RuleID rule,
+ llvm::ArrayRef<const ForestNode *> elements) {
----------------
Rule, Elements
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:90
+ // Terminal - unused
+ uint16_t Data_;
+ // A trailing array of Node* .
----------------
remove trailing _
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:91
+ uint16_t Data_;
+ // A trailing array of Node* .
+};
----------------
ForestNode*
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:100
+public:
+ size_t nodeNum() const { return NodeNum; }
+ size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); }
----------------
count()?
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:40
+
+// An implementation of a directed acyclic graph (DAG), used as a
+// graph-structured stack (GSS) in the GLR parser.
----------------
the fact that this uses a DAG seems less important than what it represents. I
think the next sentence is a better introduction, and then we should describe
what data the stacks are modelling, finally we should describe the data
structure after (see comment below).
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43
+//
+// GSS is an efficient data structure to represent multiple active stacks, it
+// employs a stack-combination optimization to avoid potentially exponential
----------------
as with forest, I think this places too much emphasis on the memory-compactness
when it's not the main benefit. (If RAM were free we still wouldn't want a big
array of stacks).
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43
+//
+// GSS is an efficient data structure to represent multiple active stacks, it
+// employs a stack-combination optimization to avoid potentially exponential
----------------
sammccall wrote:
> as with forest, I think this places too much emphasis on the
> memory-compactness when it's not the main benefit. (If RAM were free we still
> wouldn't want a big array of stacks).
To address the two above comments, maybe something like.
```
A Graph-Structured Stack represents the multiple parse stacks for a generalized
LR parser.
Each node stores a parse state, the last parsed ForestNode, and its parent(s).
There may be several heads (top of stack), and the parser operates by:
- shift: pushing terminal symbols on top of the stack
- reduce: replace N symbols on top of the stack with one nonterminal
The structure is a DAG rather than a stack:
- GLR allows multiple actions on the same head, producing forks (nodes with the
same parent).
- The parser reconciles nodes with the same (state, ForestNode), producing
joins (node with multiple parents).
The parser is responsible for creating nodes, and keeping track of the set of
heads.
```
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:46
+// growth of the stack:
+// - combing equal stack prefixes -- A new stack doesn't need to have a full
+// copy of its parent’s stack. They share a common prefix.
----------------
combing -> combining
I wonder if combining is the right explanation here - unlike local ambiguity
packing, it's not like we produce two things and then merge them.
What about rather: `sharing stack prefixes: when the parser must take two
conflicting paths, it creates two new stack head nodes with the same parent.`
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48
+// copy of its parent’s stack. They share a common prefix.
+// - combing euqal stack suffices -- as there are a finite number of DFA's
+// state the parser can be in. A set of heads can be in the same state
----------------
combing -> combining
euqal -> equal
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48
+// copy of its parent’s stack. They share a common prefix.
+// - combing euqal stack suffices -- as there are a finite number of DFA's
+// state the parser can be in. A set of heads can be in the same state
----------------
sammccall wrote:
> combing -> combining
> euqal -> equal
the "-- as..." clause is a bit confusing and doesn't seem necessary.
I think what's missing here is a high level explanation of what's going on in
the parse to trigger this, and explicitly mentioning this is how we get a node
with multiple parents.
When alternative parses converge on the same interpretation of later tokens,
their stack heads end up in the same state. These are merged, resulting in a
single head with several parents.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:53
+//
+// E.g. we have two active stacks:
+// 0 -> 1 -> 2
----------------
this shows forking but not merging
and in particular it suggests that #heads == #stacks, which is not the case
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:54
+// E.g. we have two active stacks:
+// 0 -> 1 -> 2
+// | ^ head1, representing a stack [2, 1, 0]
----------------
arrows are pointing the wrong way (at least opposite to our pointers)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:58
+// ^ head2, representing a stack [3, 1, 0]
+struct Graph {
+ // Represents a node in the graph.
----------------
the name "Graph" is IMO too general, this isn't a reusable graph or even dag
class.
I think GSS is fine, it's a distinctive name from the literature, and the
expansion of the abbreviation is nice and descriptive (but too verbose to be
the actual name)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:59
+struct Graph {
+ // Represents a node in the graph.
+ struct Node {
----------------
this comment doesn't say anything, remove?
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:60
+ // Represents a node in the graph.
+ struct Node {
+ // The parsing state presented by the graph node.
----------------
alignas(Node*)
(in practice this will match ForestNode* so is a no-op, but it documents the
intent)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:61
+ struct Node {
+ // The parsing state presented by the graph node.
+ LRTable::StateID State : LRTable::StateBits;
----------------
again, drop comment unless there's something to say
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:64
+ static constexpr unsigned PredecessorBits = 3;
+ // Number of the predecessors of the node.
+ // u is the predecessor of v, if u -> v.
----------------
This is not what predecessor usually means in graph theory: u is the
predecessor of v if there is *some path* from u->v.
I think "parent" is a common and well-understood term.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:68
+ // The forest node for a termina/nonterminal symbol.
+ // The symbol correponds to the label of edges which leads to current node
+ // from the predecessor nodes.
----------------
Also not sure what this comment says:
- first line just repeats the type: type is a forest node, forest nodes are
always for symbols, and terminal/nonterminal are all the possibilities
- second line is either referring to or defining (not sure) edge labels, but
edge labels aren't defined or referred to anywhere else, so why?
Maybe:
```
The parse tree for the last symbol we parsed.
This symbol appears on the left of the dot in the parse state.
```
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:77
+
+ bool operator==(const Node &L) const {
+ return State == L.State && predecessors() == L.predecessors();
----------------
why is Parsed not part of the identity?
(maybe there's a good reason, but there should be a comment)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:87
+ llvm::ArrayRef<const Node *> Predecessors) {
+ assert(Predecessors.size() < (1 << Node::PredecessorBits) &&
+ "Too many predecessors to fit in PredecessorBits!");
----------------
8 parents isn't that many and it seems like this might be a dynamic property of
the input code rather than a static property of the grammar.
But I don't think this bitpacking is buying anything, it looks like the layout
is:
- State : 13
- PredecessorCount : 3
- (padding) : 48
- Parsed : 64
So I think we might as well just use a uint16 PredecessorCount and still have
room left over for a uint16 refcount later. Is there anything else we might
usefully store in the extra space?
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:100
+ size_t bytes() const { return Arena.getTotalMemory() + sizeof(*this); }
+ size_t nodeCount() const { return NodeCount; }
+
----------------
name consistently with forest arena
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:112
+
+ const ForestNode *parse(const TokenStream &Code);
+
----------------
as discussed offline, the signature here should involve a token range, or
likely an ArrayRef<ForestNode> for the relevant terminals.
Probably a FIXME to allow the start symbol to be specified.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:114
+
+ const Graph &getGSS() const { return GSS; }
+
----------------
sammccall wrote:
> this function never exposes an interesting graph, as it's mostly empty before
> + after parse().
> It's only really useful for working out how much memory is allocated.
>
> If we were to drop this, the public interface of GLRParser would be a single
> function (so it need not be a public class) and Graph would disappear
> entirely!
>
> With this in mind, it seems like we could replace this header file with
> something like:
>
> ```
> struct ParseStats {
> unsigned GSSBytes = 0;
> };
> ForestNode *glrParse(const TokenStream &, const LRTable&, const Grammar&,
> ForestArena&, ParseStats *Stats = nullptr);
> ```
>
> It feels like hiding the GSS might be an obstacle to debugging/testing.
> However this really would need a story/API around how to interrupt the parser.
> LLVM_DEBUG seems reasonable to me for now.
After offline discussion I think we either want:
- to hide GSS as mentioned above, and just test the forest output
- expose GSS class, and have a two versions of the parse function: one that
finalizes the GSS into a result ForestNode, and a "raw" one that just returns
the GSS. Then we can get at the GSS state at a point by running the raw parser
on some prefix of the code. (Or even *only* have the raw one and make it the
caller's responsibility to extract the node from the GSS, but this is probably
silly)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:129
+ // Frontier is to track all avaiable actions from all active stack heads.
+ struct Frontier {
+ // A corresponding stack head.
----------------
I love the name frontier!
Unfortunately the word refers to the whole boundary/set, rather than individual
elements of it.
Maybe this?
```
/// The list of actions we're ready to perform.
struct {
std::vector<Pending> Shift;
std::vector<Pending> Reduce;
std::vector<Pending> Accept;
bool empty(); // accept is an odd-one-out here, but I think it's going away?
} Frontier;
```
Also maybe a FIXME that the frontier needs to order reductions so that local
ambiguity packing is guaranteed to work as a single pass
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:133
+ // An action associated with the Head.
+ const LRTable::Action *PerformAction = nullptr;
+ };
----------------
just Action?
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Forest.cpp:30
+ }
+ llvm_unreachable("unhandle node kind!");
+}
----------------
unhandled
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:33
+ ParsedForest.init(Code);
+ const Token *Lookahead = &Code.tokens().front();
+ addActions(GSS.addNode(/*StartState*/ 0, nullptr, {}), *Lookahead);
----------------
sammccall wrote:
> It actually currently looks like it would be at least as natural to have the
> parser operate on the sequence of terminal ForestNodes rather than on the
> Tokens themselves...
I'd suggest calling this NextTok instead of Lookahead:
- in common english this is a verb rather than a noun
- in parsers it more often refers to the *number* of tokens in the tables than
the tokens themselves
- referring to the token you're *currently shifting* as "ahead" is bizarre!
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:67
+
+std::vector<const Graph::Node *>
+GLRParser::performShift(Token::Index Lookahead) {
----------------
if this is a hot loop we shouldn't be creating std::vectors and throwing them
away
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:80
+
+ // Merge the stack -- if multiple stack heads are going to shift a same
+ // state, we perform the shift only once by combining these heads.
----------------
we shift *tokens*, not states: the token conceptually moves from the input to
the stack. (This seems to be other references not just me!)
if multiple stack heads will reach the same state after shifting a token?
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:85
+ // state 2 and state 3:
+ // 0 -> 1 -> 2
+ // ` -> 3
----------------
I think the arrows directions aren't particularly clear (our pointers go the
opposite way) or necessary here, and backticks and up arrows are hard to read.
A couple of (widely-supported) box-drawing characters would help a lot I think:
```
0--1--2
└--3
0--1--2--4
└--3--┘
```
(I would keep using `-` and `|` rather than making everything pretty boxes,
it's readable enough and easier to edit. We may want to consider box-drawing
characters for dump functions though...)
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:93
+ // Shifts are partitioned by the shift state, so each partition (per loop
+ // iteration) corresponds to a "perform" shift.
+ llvm::sort(ShiftList, [](const Frontier &L, const Frontier &R) {
----------------
I can't parse `a "perform" shift`.
Batch shifts by target state so we can merge matching groups?
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:97
+ R.PerformAction->kind() == LRTable::Action::Shift);
+ return std::forward_as_tuple(L.PerformAction->getShiftState(), L.Head) <
+ std::forward_as_tuple(R.PerformAction->getShiftState(), R.Head);
----------------
I don't think tiebreaking by `Head` achieves anything.
If we really need to ensure deterministic behavior, comparing pointers won't do
that as allocation may vary.
On the other hand, `stable_sort` should work: we'll tiebreak by order enqueued,
so we're deterministic if our inputs are.
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:108
+ // Predecessors of the new head in GSS.
+ std::vector<const Graph::Node *> Predecessors;
+ llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
----------------
SmallVector
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:110
+ llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
+ assert(llvm::find(Predecessors, F.Head) == Predecessors.end() &&
+ "Unexpected duplicated stack heads during shift!");
----------------
llvm::is_contained
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:114
+ });
+ const auto *Head = GSS.addNode(NextState, Leaf, Predecessors);
+ LLVM_DEBUG(llvm::dbgs()
----------------
if this turns out to be hot, you could avoid the temporary array of
predecessors: return a mutable node from addNode
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:128
+static std::vector<std::string>
+getStateString(llvm::ArrayRef<const Graph::Node *> A) {
+ std::vector<std::string> States;
----------------
name doesn't match return type
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:131
+ for (const auto &N : A)
+ States.push_back(llvm::formatv("state {0}", N->State));
+ return States;
----------------
I suspect `S{0}` will be verbose enough, it's not likely spelling out `state`
will make this much less cryptic, but it may be harder to scan
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:141
+ assert(PathStorage.empty() && "PathStorage must be empty!");
+ std::function<void(const Graph::Node *, unsigned)> enumPath =
+ [&CB, &PathStorage, &enumPath](const Graph::Node *Current,
----------------
We're leaning on std::function a lot for code that's supposedly in the hot path.
It's probably fine, but I'd like to see this written more directly.
As far as I can tell, this could easily be a class, with enumerateReducePath a
(recursive) method that concretely calls into handleReducePath() or something
at the bottom.
Also if it's a class we can easily reset() instead of destroying it if we want
to reuse its vectors across separate reductions.
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:158
+// heads.
+void GLRParser::performReduction(const Token &Lookahead) {
+ if (!ReduceList.empty())
----------------
(just a note, I need to dig into this part tomorrow!)
================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33
desc("Print the LR table for the grammar"));
+static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
+ init(""));
----------------
nit: a C++ source file => a source file
================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33
desc("Print the LR table for the grammar"));
+static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
+ init(""));
----------------
sammccall wrote:
> nit: a C++ source file => a source file
Why is this a new flag independent of `Source`?
It also seems inconsistent with other flags, which describe what output is
requested rather than what computation should be done.
Maybe "-print-forest" and "-print-stats"? (which could also potentially control
stats of other stages)
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121150/new/
https://reviews.llvm.org/D121150
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits