Dear Tom, First, thanks for the discussion about my "hint" infrastructure patch submission. Whatever the opinion, it helps.
> We'd have to write our own version of bison's verbose-error code anyway, > because the canned code doesn't support localization --- it uses > hardwired strings for "expecting" and so on. But it looks possibly > doable to me. I also played a little bit with the parser over the weekend. I found the same issues that you noted in your mail, tried more options, and found other problems. I sent here the full mail I wrote. Although I agree that it is "doable", I have stronger reserve than yours. Also, I do not find it an appealing solution to change "gram.c" a lot. The problems I found are: (1) not available raw information. The automaton stack keeps states, which are not directly linked to rules and terminals. The terminals are not available, they must be kept separatly if you want them. This can be done in yylex(). The internal state, stack, token... are local variables within yyparse(). As a result, they are not accessible from yyerror. I haven't found any available hook, so you have to hack "gram.c" to get this information. It is a 1 line hack, but it modifies the generated code. Or you have to produce a new hacked "gram.c", which is what you suggest. (2) hard to compute follow set The information about the syntax is there, obviously! However, from a given automaton state, the actually allowed tokens are not directly available. Only some are directly available, so the result is: DROP ? ; ERROR: syntax error at or near "?" at character 6 HINT: hint status is state=422 n=796 char=575 type=320=Op stack= 0 16 422 tokens= 88:DROP 320:Op expected= 150:LANGUAGE? Other tokens may be found, but you have either: - to *simulate* the automaton forward through reductions to check whether it is finally accepted. Not really beautiful, but that can be done. It is just the bison code without the reduction production rules;-) - move backwards before doing the above, if some reductions where performed because of the submitted token and finally resulted in the error, the state that lead to the error may not be the highest available one, so maybe other allowed tokens may also be missed. We would need to have the last state before any reduction. Another small hack in generated "gram.c" would be needed to get it. However, (3) the computed follow set is often useless, as you noted. The reason for that is that keywords are accepted in place of identifiers. The use of rules: reserved_keyword, func_name_keyword, col_name_keyword, unreserved_keyword as possible column/function/... identifiers is IMHO the main reason for the size of the generated automaton. If you drop that, everything would be much simpler, because it would reduce a lot the number of arcs in the automaton. In particular, anything which is just an identifier should not be given a name (say, BOOLEAN_P or SMALLINT should not be tokens, but could rather be recognized as a special case of identifier from within the parser, maybe with some simple post-filtering). As you noted, for things like "SHOW 123", the follow set basically includes all keywords although you can have SHOW ALL or SHOW name. So, as you suggested, you can either say "ident" as a simplification, but you miss ALL which is meaningful, or you say all keywords, which is useless. An alternative is to know that you're within a "SHOW something", that is you somehow reparse the code from within the syntax error:-( It is necessary if you want to say something more interesting than "ident", say "option name". You may also get this information by simulating the automaton forward, or noticing that some of the current on stack states can only lead to the reduction of a ShowStmt... Well, on the positive side, it is sometimes right, say: REVOKE ? ; ERROR: syntax error, unexpected Op at or near "?" at character 8 HINT: hint status is state=481 n=1297 char=575 type=320=Op stack= 0 31 481 tokens= 234:REVOKE 320:Op expected= 10:ALL 59:CREATE 80:DELETE_P 98:EXECUTE 136:INSERT 224:REFERENCES 239:RULE 244:SELECT 268:TEMP 270:TEMPORARY 279:TRIGGER 292:UPDATE 293:USAGE? (4) the available automaton data are inverted with respect to what would be needed... It would be nice to know whether we're heading for a create statement or something like that, but the information is hidden down the automaton. basically, the automaton would need to be "inverted" to compute this information. We would need the initial information in "gram.y", which was "inverted" to build the automaton. So the code somehow needs to undo what bison compilation has done. (5) anything that can be done would be hardwired to one version of bison. There is a lot of asumptions in the code and data structures, and any newer/older version with some different internal representation would basically break any code that would rely on that. So postgres would not be "bison" portable:-( I don't think it is an real option that old postgresql source would be broken against future bison releases. So, as far as I'm concerned, I don't find the internal way really convincing as the way to go. Thus I can see three options: (a) put hints in "gram.y" as I already suggested + I can do it + it can be incremental once the infrastructure is in place. + hints are simple to add or fix = hints are not necessarily marvellous - it would take time to develop - it may slow down the parser, because of added production rules. - it would change "gram.y" a lot and could make it harder to maintain - patchers don't like it... (b) write a new "recursive descendant" parser, and drop "gram.y" + I could do it + hints could be more intelligent + I think the parser would be faster, at least not slower = I'm not sure it would be really easier to maintain, but maybe not harder - it cannot be incremental at all - hints are at the price of some programming - it would take time to develop, but basically the structure would look like "gram.y" a lot, and existing data structures can be kept. - I'm not sure patchers would like it, and if it is thrown down the drain, I would not be happy for the one who spent the time (say, me;-) (c) do nothing. + I can do that quite easily;-) + patchers are ok with that + it does not take time = the maintainability of "gram.y" is not changed. - my students won't have hints. A funny technical detail is that the refactoring which is needed for the (a) option would be also needed for (b). (b) would require more time than (a), but the results could be better for the hint/parser error handling. I can have some time for this, especially over the next few months. Incremental options looked a better way to me, but (a) seems stuck and (b) is risky, and the internal way looks ugly. As a side effect of my inspection is that the automaton generated by bison could be simplified if a different tradeoff between the lexer, the parser and the post-processing would be chosen. Namelly, all tokens that are just identifiers should be dropped and processed differently. Have a nice day, -- Fabien Coelho - [EMAIL PROTECTED] ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster