On Jun 25, 2012, at 2:37 PM, Dmitri Gribenko <[email protected]> wrote:
> Hi Doug, > > Thank you for the review! Comments inline, updated patch attached. > > On Mon, Jun 25, 2012 at 10:45 AM, Douglas Gregor <[email protected]> wrote: >> +namespace tok { >> +enum TokenKind { >> + eof, >> + newline, >> + text, >> + command, >> + verbatim_block_begin, >> + verbatim_block_line, >> + verbatim_block_end, >> + verbatim_line, >> + html_tag_open_name, >> + html_tag_open_attr, >> >> I find the choice of html_tag_open_name/html_tag_open_attr interesting. >> Presumably, you're doing this so that we don't need to have '=' and string >> literals as token kinds (because we generally don't want them as token >> kinds), and that makes sense. However, shouldn't there be an >> html_tag_open_end token as well, to capture the '>'? Otherwise, the parser >> has to either intuit the ending tag based on having a non-html_tag_open* >> token or poke at the last character in each html_tag_open_* token. > > The idea was to eat html_tag_open_attr until we see a > non-html_tag_open_attr token. > >> I see that you don't have location information for the '=' or for the value >> in TextValue2, although I guess you can intuit it from the length? > > Oh, the locations for '=' and '>' are lost... Seems like I'll need to > introduce '<', html_ident, '=', html_quoted_string, '>' as tokens in > html state. Done. I like this a lot better, thanks! >> + /// Registered verbatim-like block commands. >> + VerbatimBlockCommandVector VerbatimBlockCommands; >> <snip> >> + /// Registered verbatim-like line commands. >> + VerbatimLineCommandVector VerbatimLineCommands; >> <snip> >> + /// \brief Register a new verbatim block command. >> + void addVerbatimBlockCommand(StringRef BeginName, StringRef EndName); >> + >> + /// \brief Register a new verbatim line command. >> + void addVerbatimLineCommand(StringRef Name); >> >> Do we need this much generality? It seems like a StringSwitch for each case >> (verbatim block and verbatim line) would suffice for now, and we could >> optimize that later by TableGen'ing the string matcher for the various >> Doxygen/HeaderDoc/Qt/JavaDoc command names. > > Converted to StringSwitch, but retained the vector search. I think we > need both: string matcher to handle predefined commands and a vector > to handle commands registered dynamically. Are we expecting that commands registered dynamically will have their own parsers that build custom AST nodes? > But the StringSwitch should go because upcoming parsing and semantic > analysis know command properties in detail. Thus, command lists would > be duplicated here and there. Will need to eventually Tablegen this. Yes, we'll want to tablegen this, for performance if nothing else. Obviously, it doesn't have to happen now. >> + while (TokenPtr != CommentEnd) { >> + switch(*TokenPtr) { >> + case '\\': >> + case '@': { >> + TokenPtr++; >> + if (TokenPtr == CommentEnd) { >> + formTokenWithChars(T, TokenPtr, tok::text); >> + T.setText(StringRef(BufferPtr - T.getLength(), T.getLength())); >> + return; >> + } >> + char C = *TokenPtr; >> + if (StringRef("\\@&$#<>%\".:").find(C) != StringRef::npos) { >> >> This 'if' would almost surely be more efficient as a switch. More >> importantly, it should only apply to the '\\' case, and not the '@' case. > > (a) Converted to switch. (b) Doxygen allows @ as escape character, for > example @\c will yield \c. I didn't know that, thanks! >> Also, I suspect that the various \f[…] and \f(…) commands actually to >> paren-, brace-, and bracket-matching internally. Please add a FIXME here so >> we don't forget to deal with this later (but, of course, it's not at all >> important to solve at this point!). > > Doxygen doesn't check that. And actually it might be harmful to do so > because a complex LaTeX formula (with \phantom, for example) may have > paren imbalance. Good point. > >> + default: { >> + while (true) { >> + TokenPtr++; >> + if (TokenPtr == CommentEnd) >> + break; >> + char C = *TokenPtr; >> + if(C == '\n' || C == '\r' || >> + C == '\\' || C == '@' || C == '<') >> + break; >> + } >> >> I hadn't realized it before, but now I see in the code that the string >> tokens returned by the lexer are not 'maximal', in the sense that they don't >> necessarily cover an entire string. For example, something like "foo \& bar" >> comes in as three tokens, "foo ", "&", and " bar". That makes > > Seems like the sentence ends unexpectedly here, but I got the point. > But returning maximal tokens will lead to memory allocation for > unescaped string (currently we get away with returning pointers into > source file). That makes sense. >> + case '*': { // C comment. >> + BufferPtr++; // Skip star. >> + >> + // Skip Doxygen magic marker. >> + const char C = *BufferPtr; >> + if ((C == '*' && *(BufferPtr + 1) != '/') || C == '!') >> + BufferPtr++; >> + >> + // Skip less-than symbol that marks trailing comments. >> + if (BufferPtr != CommentEnd && *BufferPtr == '<') >> + BufferPtr++; >> >> Should we (somewhere, at some point) eliminate lines of all '*'s in C >> comments, e.g., >> >> /************** >> * Docs >> **************/ >> >> ? >> >> This seems not at all important now, but we should consider it for the >> future. (FIXME somewhere, perhaps). > > I think that it should be done while post-processing the comment text > (in CommentSema, i think). Actually, this looks very much like > markdown 'rulers'. Okay. Updated patch looks great! - Doug _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
