On Freitag, 10. Mai 2019 07:24:51 CEST Akim Demaille wrote: > Hi Christian,
Hi Akim! > Aren't you referring to LA correction, as implemented in Bison? > > https://www.gnu.org/software/bison/manual/html_node/LAC.html Well no, it is has somewhat in common, but the use cases are differently, see below. > > That's actually a very common required feature in practice. The main use > > cases are: > > > > 1. Auto completion features > > > > 2. Human friendly error messages > > I think you are referring to the name of the tokens, not all the symbols. > For the error messages, it makes sense. Although I am now more convinced > that most of the time, error messages are more readable when you quote > exactly the source than when you print token names. No, I really mean the non-terminal symbols. Because sometimes [not always ;-)] I don't use the classic approach of separate lexer and parser, but rather let the parser do the lexer job as well, like: FOO : 'F''O''O' ; which can avoid complex and error prone push and pop constructs that would be required with a separate lexer approach and certain complex grammars. So effectively some of the non-terminals (from Bison point of view) are in such specific use cases "terminals" from use case point of view. But one of the main problems of this approach is that e.g. the default syntax error messages would no longer be useful at all. Because you would get a message like: Unexpected 'b', expecting 'F'. Instead of e.g. Unexpected 'bla', expecting 'FOO'. > > I do need these features for almost all parsers, hence for years (since > > not > > available directly with Bison) I have a huge bunch of code on top of the > > internal skeleton code to achieve them. > > Is there available for reading somewhere? Well, I asked couple years ago on the list if anybody was interested in what I am doing to achieve these kinds of features. But got no positive reply, hence I did not invest time to write about it in detail yet. However I am not sure if my approach would be of use for you anyway. Basically what I am doing (roughly described) is C++ code on top of the internal generated parser tables which replicate what the skeleton LALR(1) parser would do. So the algorithm takes a parser state as argument: std::vector<YYTYPE_INT16>& stack and then it walks the possible routes starting from there (accessing the auto generated tables directly), that is pushing for each route/branch on stack, reducing whenever required etc. And to detect endless recursions while doing this, I always track the complete parser history with typedef std::set< std::vector<YYTYPE_INT16> > YYStackHistory; So each entry in this history set is a previous parser symbol stack, and when I reach a new parser state I check if the current parser stack already exists in that history set and if it does exist already, then it reached an endless recursion and I abort the algorithm for that individual branch and then continue with the next branch, and so on. That full history tracking might take a lot of memory of course though, but eventually the algorithm ends up returning a result: std::map<String,BisonSymbolInfo>& expectedSymbols where the result map (not a multi map actually) contains the possible next grammar rules for the previously supplied parser state; key being the symbol name, value is a struct (BisonSymbolInfo) holding a) the sequence of characters expected next for satisfying that grammar symbol and b) a flag whether the symbol is (considered as) terminal or a "real" non-terminal (see below). So that C++ code is for me the basis for retrieving the next possible grammar rules for any given parser state, which I use both for error messsages as well as for auto completion features. Another problem with my pseudo-terminals (example FOO above): From Bison point of view, everything in my grammar are now non-terminals, and I don't want to manually mark individual grammar rules to be either considered as "real" non- terminals and others to be considered as "terminals", So I automated that by checking whether the same symbol can be resolved in more than one way, then it is a "real" non-terminal, and by checking if the right hand side of the rule only contains characters, then it is considered as "terminal. And that fundamental information is automatically used to provide appropriate error messages and auto completion in a fully automated way. > Was the feature always fitting perfectly? Never ever did it result in something somewhat incorrect? I did not make a proof of correctness of the algorithm. As you know you have to spend a huge amount of time to really proof this, which I did not. Especially when you are working on commercial projects you also have to be pragmatic sometimes and also accept the chance of imperfectness for the sake of getting forward and weigh costs on usefulness. The main issue was catching endless recursions that I solved like described above. I have not investigated whether there would be cases the algorithm would fail, there might be. Keep in mind the use cases for this was: error messages and auto completion, not to execute (i.e. fatal) grammar rule actions, so the potential harm is limited. Even if the algorithm would fail in certain cases, the worst you get is that auto completion does not come up with a suggestion in that specific case or the error message being too short. It is still an improvement on the overall situation though of not having the possibility for these kinds of features at all. > >> In addition, tokens have several names: the identifier, and the > >> string name, like > >> > >> %token <string> ID "identifier" > >> > >> Not to mention that I also want to provide support for > >> internationalization. So what name should that be? ID? > >> identifier? or identifiant in French? > > > > In practice you just need the symbol name as is. Nobody needs the > > translation, > I beg to disagree. Nobody should translate the keyword "break", > but > > > # bison /tmp/foo.y > > /tmp/foo.y:1.7: erreur: erreur de syntaxe, : inattendu, attendait char ou > > identifier ou <tag>> > > 1 | %token: FOO > > > > | ^ > > looks stupid; "char", "identifier" and "<tag>" should be translated. Well, my point was that translation is trivial. An average developer is certainly able to solve required translation tasks very easily by custom code if required. The other aspects though, looking ahead on a parser states with custom code is not trivial. So my point was: on doubt it might make sense to concentrate on the mandatory aspect of providing access to the raw symbol names and ignoring the translation aspects for now. Best regards, Christian Schoenebeck _______________________________________________ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison