Re: [cfe-commits] [PATCH] Lexer for structured comments

Douglas Gregor Tue, 26 Jun 2012 11:32:07 -0700

On Jun 25, 2012, at 2:37 PM, Dmitri Gribenko <[email protected]> wrote:


> Hi Doug,
> 
> Thank you for the review!  Comments inline, updated patch attached.
> 
> On Mon, Jun 25, 2012 at 10:45 AM, Douglas Gregor <[email protected]> wrote:
>> +namespace tok {
>> +enum TokenKind {
>> +  eof,
>> +  newline,
>> +  text,
>> +  command,
>> +  verbatim_block_begin,
>> +  verbatim_block_line,
>> +  verbatim_block_end,
>> +  verbatim_line,
>> +  html_tag_open_name,
>> +  html_tag_open_attr,
>> 
>> I find the choice of html_tag_open_name/html_tag_open_attr interesting. 
>> Presumably, you're doing this so that we don't need to have '=' and string 
>> literals as token kinds (because we generally don't want them as token 
>> kinds), and that makes sense. However, shouldn't there be an 
>> html_tag_open_end token as well, to capture the '>'? Otherwise, the parser 
>> has to either intuit the ending tag based on having a non-html_tag_open* 
>> token or poke at the last character in each html_tag_open_* token.
> 
> The idea was to eat html_tag_open_attr until we see a
> non-html_tag_open_attr token.
> 
>> I see that you don't have location information for the '=' or for the value 
>> in TextValue2, although I guess you can intuit it from the length?
> 
> Oh, the locations for '=' and '>' are lost...  Seems like I'll need to
> introduce '<', html_ident, '=', html_quoted_string, '>' as tokens in
> html state.  Done.

I like this a lot better, thanks!

>> +  /// Registered verbatim-like block commands.
>> +  VerbatimBlockCommandVector VerbatimBlockCommands;
>> <snip>
>> +  /// Registered verbatim-like line commands.
>> +  VerbatimLineCommandVector VerbatimLineCommands;
>> <snip>
>> +  /// \brief Register a new verbatim block command.
>> +  void addVerbatimBlockCommand(StringRef BeginName, StringRef EndName);
>> +
>> +  /// \brief Register a new verbatim line command.
>> +  void addVerbatimLineCommand(StringRef Name);
>> 
>> Do we need this much generality? It seems like a StringSwitch for each case 
>> (verbatim block and verbatim line) would suffice for now, and we could 
>> optimize that later by TableGen'ing the string matcher for the various 
>> Doxygen/HeaderDoc/Qt/JavaDoc command names.
> 
> Converted to StringSwitch, but retained the vector search.  I think we
> need both: string matcher to handle predefined commands and a vector
> to handle commands registered dynamically.

Are we expecting that commands registered dynamically will have their own 
parsers that build custom AST nodes?

> But the StringSwitch should go because upcoming parsing and semantic
> analysis know command properties in detail.  Thus, command lists would
> be duplicated here and there.  Will need to eventually Tablegen this.

Yes, we'll want to tablegen this, for performance if nothing else. Obviously, 
it doesn't have to happen now.

>> +  while (TokenPtr != CommentEnd) {
>> +    switch(*TokenPtr) {
>> +      case '\\':
>> +      case '@': {
>> +        TokenPtr++;
>> +        if (TokenPtr == CommentEnd) {
>> +          formTokenWithChars(T, TokenPtr, tok::text);
>> +          T.setText(StringRef(BufferPtr - T.getLength(), T.getLength()));
>> +          return;
>> +        }
>> +        char C = *TokenPtr;
>> +        if (StringRef("\\@&$#<>%\".:").find(C) != StringRef::npos) {
>> 
>> This 'if' would almost surely be more efficient as a switch. More 
>> importantly, it should only apply to the '\\' case, and not the '@' case.
> 
> (a) Converted to switch. (b) Doxygen allows @ as escape character, for
> example @\c will yield \c.

I didn't know that, thanks!

>> Also, I suspect that the various \f[…] and \f(…) commands actually to 
>> paren-, brace-, and bracket-matching internally. Please add a FIXME here so 
>> we don't forget to deal with this later (but, of course, it's not at all 
>> important to solve at this point!).
> 
> Doxygen doesn't check that.  And actually it might be harmful to do so
> because a complex LaTeX formula (with \phantom, for example) may have
> paren imbalance.

Good point.
> 
>> +      default: {
>> +        while (true) {
>> +          TokenPtr++;
>> +          if (TokenPtr == CommentEnd)
>> +            break;
>> +          char C = *TokenPtr;
>> +          if(C == '\n' || C == '\r' ||
>> +             C == '\\' || C == '@' || C == '<')
>> +            break;
>> +        }
>> 
>> I hadn't realized it before, but now I see in the code that the string 
>> tokens returned by the lexer are not 'maximal', in the sense that they don't 
>> necessarily cover an entire string. For example, something like "foo \& bar" 
>> comes in as three tokens, "foo ", "&", and " bar". That makes
> 
> Seems like the sentence ends unexpectedly here, but I got the point.
> But returning maximal tokens will lead to memory allocation for
> unescaped string (currently we get away with returning pointers into
> source file).

That makes sense.

>> +    case '*': { // C comment.
>> +      BufferPtr++; // Skip star.
>> +
>> +      // Skip Doxygen magic marker.
>> +      const char C = *BufferPtr;
>> +      if ((C == '*' && *(BufferPtr + 1) != '/') || C == '!')
>> +        BufferPtr++;
>> +
>> +      // Skip less-than symbol that marks trailing comments.
>> +      if (BufferPtr != CommentEnd && *BufferPtr == '<')
>> +        BufferPtr++;
>> 
>> Should we (somewhere, at some point) eliminate lines of all '*'s in C 
>> comments, e.g.,
>> 
>> /**************
>>  * Docs
>>  **************/
>> 
>> ?
>> 
>> This seems not at all important now, but we should consider it for the 
>> future. (FIXME somewhere, perhaps).
> 
> I think that it should be done while post-processing the comment text
> (in CommentSema, i think).  Actually, this looks very much like
> markdown 'rulers'.

Okay.

Updated patch looks great!

        - Doug


_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: [cfe-commits] [PATCH] Lexer for structured comments

Reply via email to