Re: RFC: custom error messages

Akim Demaille Sun, 05 Jan 2020 08:53:45 -0800

Hi Christian,

Sorry I missed you message.  For some reason the title of the thread
was broken in the other answers.


> Le 3 janv. 2020 à 13:08, Christian Schoenebeck <schoeneb...@crudebyte.com> a 
> écrit :
> 
> On Freitag, 3. Januar 2020 11:07:05 CET Akim Demaille wrote:
>> One severe issue brought to my attention by Rici Lake (unfortunately
>> privately, although he had written a very nice and detailed mail with
>> all the details) is that this would break several existing parsers
>> that expect yytname to be this way.  For instance he pointed to
>> 
>> https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libksba.git;a=blob;f=src/asn1-par
>> se.y;h=5bff15cd8db64786f7c9e2ef000aeddd583cfdc0;hb=HEAD#l856
>> currently not responding, but the code is:
>> | for (k = 0; k < YYNTOKENS; k++)
>> | 
>> |   {
>> |   
>> |     if (yytname[k] && yytname[k][0] == '\"'
>> |     
>> |         && !strncmp (yytname[k] + 1, string, len)
>> |         && yytname[k][len + 1] == '\"' && !yytname[k][len + 2])
>> |       
>> |       return yytoknum[k];
>> |   
>> |   }
> 
> Looks like the use case here is to distinguish non-terminals from terminal 
> symbols. That could be addressed by introducing some official API function:
> 
> bool yy_is_non_terminal(enum yysymbolid id);
> 
> and/or:
> 
> bool yy_is_terminal(enum yysymbolid id);

Not exactly.  The test here is to tell the difference between
string aliases ("break" represented as "\"break\"") and plain symbols
(TOK_BREAK, represented as "TOK_BREAK").  The difference bw terminal
and non terminals is handled by the loop itself: starting at YYNTOKENS,
it's only nonterminals.

Anyway, as I mentioned I don't want to support this.  And I will not
make it easier.


> Then those double quotes could simply be dropped. Or was there any other use 
> case for looking at those double quote characters?

I definitely want to get rid of these quotes!  But not with 'verbose'
error messages, only with 'custom' and 'rich'.


>> I think he is right, hence the call to yysyntax_error_arguments which
>> returns the list of expected/unexpected tokens.
> 
> Actuallly I had a general purpose push API in mind. Your suggestion would 
> limit retrieving the "next expected symbols" solely to error message 
> purposes. 

yes, I'm focusing on improving the error messages, which is probably
the most common request these last years.


> Why not making that a general-purpose function instead that users could call 
> at any time with the current parser state:
> 
> // returns NULL terminated list
> const enum yysymbolid* yynextsymbols(const yystate* currentParserState);

I don't want to have to deal with allocating space.  Your proposal
needs to allocate space.  Hence the clumsy interface I provided :)

> Because there are other important use cases that I pointed out to you:
> auto completion features; e.g. interactive command line shells where the user 
> can auto complete the currently incomplete command by hitting tab key, or a 
> programming language code editor GUI/IDE where the user would get a non-
> obtrusive popup while typing for potential code completions. In these use 
> cases you are not (necessarily) addressing syntax errors. The parser might be 
> very well in some valid state.

I see your point.

> For that purpose, and to continue the idea about a general purpose push API, 
> it would be very useful to have a function for duplicating the current parser 
> state:
> 
> yystate* yydupstate(const yystate* parserState);

Wow, you're talking about massive surgery in yacc.c.  Roughly,
stop using local variables for the stacks.  Which is what the
push-interface does (I'm talking about api.push here).

Or are you referring to push-parsers when you say "push API"?


> and one function to push parse on a specific parser state:
> 
> bool yypushparse(yystate* parserState, char nextchar);
> 
> The latter returning false on parser errors. That way people would have a 
> very 
> flexible and powerful API for all kinds of use cases. Because by being able 
> to 
> duplicate states, you can have "throw away" parser states, where you can try 
> out things without touching the "official" parser state. For instance I am 
> using 
> that to auto correct user typos in some parsers (that is guessing what user 
> had in mind on syntax errors by some limited brute force attempts by parser 
> on 
> throw-away parser states).

That might be doable with api.push.  I don't see that coming for
the pull interface.

> But there are many other use cases as well for this: for instance multi-
> threaded parsing tasks where each thread would get its own parser state and 
> each thread e.g. might be working on a different branch of a grammar tree to 
> reduce latency (overall response time) of a parser system.

Again, that's the kind of things for api.pure, not the regular
yacc.c.

>> I can't make up my mind on whether returning the list of expected
>> tokens as strings (as exemplified above), or simply as their symbol
>> numbers.  Symbol numbers are more efficient, yet they are the
>> *internal* symbol numbers, not the ones the user is exposed to.
> 
> I would suggest both. It would make sense to auto generate an enum list for 
> all symbols like:
> 
> enum yysymbolid {
>    IDENTIFIER,
>    SWITCH,
>    IF,
>    CONST,
>    ...
> };
> and use that numeric type probably for most Bison APIs for performance 
> reasons. That type could also be condensed to a smaller type if requested 
> (i.e. for embedded systems):
> 
> enum yysymbolid : uint8_t {
>    IDENTIFIER,
>    SWITCH,
>    IF,
>    CONST,
>    ...
> };
> 
> But there should still be a way for people being able to convert that 
> conveniently to its original string representation from source.y:
> 
> const char* yysymbolname(enum yysymbolid);

Yes, of course.  That's not "both", that's just what I refer
to by "exposing the numbers".  "yysymbolname(x)" is currently
just "yytname[x]".

> Happy new 2k20 BTW!  ;-)

Thanks!  Best wishes to you!

Re: RFC: custom error messages

Reply via email to