Terence Parr <parrt <at> cs.usfca.edu> writes:

> 
> 
> On Jun 18, 2009, at 6:20 PM, Michaeljohn Clement wrote:
> > I think generating acceptable error messages from a parser alone is an
> > interesting hard problem.  It might be possible to do some statistical
> > analysis on a corpus of valid inputs and then derive heuristics to
> > suggestwhat the most likely error in the input string might be.
> 
> Has anyone thought of parsing backwards from the end towards the  
> detected error location? If you parse forwards and backwards you might  
> be able to zoom in on a problem area. Of course if there are lots of  
> errors following the first one, it won't help you too much. It's sort  
> of what a human does though, isn't it? We look down a few tokens and  
> work our way back up to see if we can make sense of things.
> 
> The other thing I wondered about. Can we launch a whole bunch of  
> threads using multiple core to sniff the input to improve error  
> analysis? Maybe we launch parsers at multiple points in the input  
> stream and then use the interpretation that yields the fewest errors.
> 
> Just random thoughts. Let's use those cores, man! Right now, all they  
> do is run Pandora and instant messaging for me. ;)
> 
> Ter
> 

Parsing backwards is interesting; however, as you've mentioned, one 
might have trouble when it comes to having more than one parse 
error in the document.

Another idea to get people thinking about might be phrase-level 
context-sensitive errors where context sensitivity is achieved by 
matching some number of parser frames (each being the application 
of a production rule) on the top of the stack against a given pattern 
so that if a phrase fails to match it will check some sequence(s) of 
production rules against the top frames on the stack and err before 
it backtracks to apply the next phrase in the production rule or 
cascades a failure.

Another idea, previously mentioned, is a production-rule-level 
parse error, where if a production rule fails to match one of its 
phrases then it will simply cause a parse error. This is very appealing, 
especially for production rules with only one phrase where an error 
is only detectable on failure of the production rule. The following 
is an example of this:

Type : 'int' : 'float' : 'char' ;
Identifier : !Type ... ;

IdentifierDeclaration
    : Type Identifier IdentifierList <semicolon>
    ;

IdentifierList
    : <comma> Identifier IdentifierList
    : <>
    ;

For the input "int foo, bar float;" the parser will go, and when applying 
IndentifierList(", bar float;"), it will match the comma and the identifier 
and *also* the IdentifierList (by failing on 'float' and backtracking to 
match <> (epsilon)). IdentDeclaration will then fail when it fails to 
match the semicolon.

This is an annoying case where a parse error is somewhat obscured 
by a successful application of a production rule. Really, the parse error 
occurs in IdentifierList, but only appears when we fail IdentDeclaration 
and subsequently cascade off of the parser stack.


_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to