Derek Peschel wrote:
On Mon, Feb 05, 2007 at 11:53:27PM -0800, Bob Foster wrote:
You might want to look at the Eclipse text API for clues how to
OK, thanks. I don't tend to install large packages very often,
but I hear a lot of good things about Eclipse.
regions and repair. This was designed for text highlighting using
scanners that partition the text into regions, each identified with a
kind. The kind is then used to direct highlighting, which can be
context-sensitive within a single partition.
Are you using "partition" and "region" interchangeably?
Probably, but a partition is a "collection of disjoint subsets of S
whose union is S" (http://mathworld.wolfram.com/SetPartition.html) where
a region is just a substring.
Scanners and
"kind" parsers can handle certain classes of language, but are those
classes any of the usual theoretical ones (context-free, context-sensitive)
or not?
I am absolutely certain there are grammars that cannot be "scanned".
For (say) C, would regions be things like comments and string literals
and ordinary code?
Yes.
Or have I misunderstood the typical "division of
labor" between scanning (creating regions) and parsing (working inside
regions)?
You understand, modulo terminology. In Eclipse, the text is scanned to
produce partitions, then each partition is rescanned to produce
substrings for coloring. Nothing in this process approximates "parsing"
but see below.
When the user changes text, the entire region of text changed is
damaged. The scanners back up to the last partition before the damaged
region and proceed forward until a newly added partition exactly matches
a previously existing partition of the same kind and span. This
completes the repair.
What if the user adds a region boundary rather than changing text within
a region?
Good question. The Eclipse scheme assumes that no partition can be
extended by characters beyond the partition. This is not a property of
grammars in general, esp. PEG grammars. Nobody said this was an easy
problem.
Take another example, a simple LISP which has only lists enclosed in
parentheses. Say it also allows semicolons to comment out line ends.
Marking comments, and counting uncommented parentheses, is about the only
thing the syntax highlighter needs to do. I don't know if Eclipse can do
that, but it's easy for me to reason about. Any addition or deletion of a
semicolon or a parenthesis could affect the counts of all later parentheses.
How might the repair algorithm handle this language?
The repair algorithm is to re(parse)scan from the start of the damaged
region/partition until the output of the parse is the same as the
previous output. Your example is just like the comment example. The
(parse)scan may need to run to the end of the text; it may not.
If you think of the output of a PEG parse as a parse tree each node of
which has attributes for the start and end position in the text, then it
is clear that any node that spans a damaged region will need to be
repaired by some amount of reparsing.
Yes, that makes sense. But that insight only tells you _that_ certain
nodes need to be regenerated, it doesn't tell you _how_ (how to find a
tree node to start the repair from and how to detect that the repair has
finished, not ignoring the fact that the structure of the parse tree may
change as a result of the edit).
Yes. The node to start the repair on is an ancestor of the damaged node,
but you don't know which ancestor.
It's probabilistic whether any incremental scheme will improve on the
responsiveness of a full parse, except the division of labor between
simple syntax coloring and full semantics coloring/error reporting.
There are two issues: In the Eclipse scheme, repair can be stopped
whenever a new partition exactly matches a previously existing partition
of the same kind and span. But PEG-derived parse trees will add (at
least) the notion of depth, as a parse tree is not a flat partition.
It doesn't matter that the parse tree came from a PEG, does it? _Any_
parse tree gives nodes a depth (and an environment, of its ancestors and
children) which means that looking at one node at a time rarely accomplishes
much because it ignores the environment. You're really saying that Eclipse
doesn't incrementally update a parse tree for the entire buffer, but only
a list of partitions. Right?
Right. Well, partially right. What Eclipse does is run the algorithm I
sketched for simple coloring, then run a full parse in the background to
deal with things that can't be handled by scanners, like whether an
identifier is an instance variable vs. a local variable vs. simply an
error. The reason for doing everything at least three times is that
simple coloring needs to happen in real time, in the user input thread,
while semantic coloring can wait. You can easily see the difference in
an Eclipse editor by typing quickly.
I'd guess this is roughly a master's thesis problem.
That wasn't what I wanted to hear, but it does help explain why I haven't
found very robust tools yet -- because they are still hard to write.
I guess the real question is are they worth writing? Users aren't
really suffering because no incremental parsers are available. They're
pretty much satisfied by a combination of fast colorers and slower
parsers...as long as the parsers aren't annoyingly slow.
-- Derek
--
"Beware of simple examples!" Oded Shmueli
_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg