On Jan 29, 2013, at 4:01 PM, Dmitri Gribenko <[email protected]> wrote:
> Hi Fariborz, > > On Wed, Jan 30, 2013 at 1:42 AM, Fariborz Jahanian <[email protected]> > wrote: >> Author: fjahanian >> Date: Tue Jan 29 17:42:26 2013 >> New Revision: 173850 >> >> URL: http://llvm.org/viewvc/llvm-project?rev=173850&view=rev >> Log: >> [Doc parsing] Patch to parse Doxygen-supported HTML character >> references to their UTIF-8 encoding. Reviewed offline by Doug. >> // rdar://12392215 >> >> Added: >> cfe/trunk/test/Index/special-html-characters.m >> Modified: >> cfe/trunk/include/clang/AST/CommentLexer.h >> cfe/trunk/lib/AST/CommentLexer.cpp >> >> Modified: cfe/trunk/include/clang/AST/CommentLexer.h >> URL: >> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/AST/CommentLexer.h?rev=173850&r1=173849&r2=173850&view=diff >> ============================================================================== >> --- cfe/trunk/include/clang/AST/CommentLexer.h (original) >> +++ cfe/trunk/include/clang/AST/CommentLexer.h Tue Jan 29 17:42:26 2013 >> @@ -282,11 +282,18 @@ private: >> /// it stands for (e.g., "<"). >> StringRef resolveHTMLNamedCharacterReference(StringRef Name) const; >> >> + /// Given a Doxygen-supported named character reference (e.g., "™"), >> + /// it returns its UTF8 encoding. >> + StringRef HTMLDoxygenCharacterReference(StringRef Name) const; >> + >> /// Given a Unicode codepoint as base-10 integer, return the character. >> StringRef resolveHTMLDecimalCharacterReference(StringRef Name) const; >> >> /// Given a Unicode codepoint as base-16 integer, return the character. >> StringRef resolveHTMLHexCharacterReference(StringRef Name) const; >> + >> + /// Helper routine to do part of the work for >> resolveHTMLHexCharacterReference. >> + StringRef helperResolveHTMLHexCharacterReference(unsigned CodePoint) >> const; >> >> void formTokenWithChars(Token &Result, const char *TokEnd, >> tok::TokenKind Kind) { >> >> Modified: cfe/trunk/lib/AST/CommentLexer.cpp >> URL: >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/AST/CommentLexer.cpp?rev=173850&r1=173849&r2=173850&view=diff >> ============================================================================== >> --- cfe/trunk/lib/AST/CommentLexer.cpp (original) >> +++ cfe/trunk/lib/AST/CommentLexer.cpp Tue Jan 29 17:42:26 2013 >> @@ -34,6 +34,31 @@ bool isHTMLHexCharacterReferenceCharacte >> >> } // unnamed namespace >> >> +static unsigned getCodePoint(StringRef Name) { >> + unsigned CodePoint = 0; >> + for (unsigned i = 0, e = Name.size(); i != e; ++i) { >> + CodePoint *= 16; >> + const char C = Name[i]; >> + assert(isHTMLHexCharacterReferenceCharacter(C)); >> + CodePoint += llvm::hexDigitValue(C); >> + } >> + return CodePoint; >> +} >> + >> +StringRef Lexer::helperResolveHTMLHexCharacterReference(unsigned CodePoint) >> const { >> + char *Resolved = >> Allocator.Allocate<char>(UNI_MAX_UTF8_BYTES_PER_CODE_POINT); >> + char *ResolvedPtr = Resolved; >> + if (ConvertCodePointToUTF8(CodePoint, ResolvedPtr)) >> + return StringRef(Resolved, ResolvedPtr - Resolved); >> + else >> + return StringRef(); >> +} >> + >> +StringRef Lexer::resolveHTMLHexCharacterReference(StringRef Name) const { >> + unsigned CodePoint = getCodePoint(Name); >> + return helperResolveHTMLHexCharacterReference(CodePoint); >> +} >> + >> StringRef Lexer::resolveHTMLNamedCharacterReference(StringRef Name) const { >> return llvm::StringSwitch<StringRef>(Name) >> .Case("amp", "&") >> @@ -41,8 +66,154 @@ StringRef Lexer::resolveHTMLNamedCharact >> .Case("gt", ">") >> .Case("quot", "\"") >> .Case("apos", "\'") >> + .Case("minus", "-") >> + .Case("sim", "~") > > Sorry, but this is wrong: sim is U+223C, minus is U+2212. Old code of mine. Not needed here. WIll remove shortly. > >> .Default(""); >> } >> + >> +StringRef Lexer::HTMLDoxygenCharacterReference(StringRef Name) const { >> + return llvm::StringSwitch<StringRef>(Name) >> + .Case("copy", helperResolveHTMLHexCharacterReference(0x000A9)) >> + .Case("trade", helperResolveHTMLHexCharacterReference(0x02122)) >> + .Case("reg", helperResolveHTMLHexCharacterReference(0x000AE)) > > ... > > Is based on the subset described in > http://www.stack.nl/~dimitri/doxygen/manual/htmlcmds.html ? Yes. > > I think we can do better than this: > > (1) linear search is not great; > (2) allocation is not great either. > > This needs some tablegen magic -- will try to hack up something tomorrow. Great. Thanks. - Fariborz > > Dmitri > > -- > main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if > (j){printf("%d\n",i);}}} /*Dmitri Gribenko <[email protected]>*/ _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
