Thanks for the response, Adrian. I got much further today.
> Yes. Ragel makes no assumptions about how the programmer wishes to > allocate memory for input buffers. Avoiding such assumptions precludes > automatic capture of matched items. > > Your choices are to copy characters into a buffer byte by byte, or to > retain pointers. The latter approach requires more care if it is > expected that interesting items span input buffers. Great. That's essentially what I've been doing now. key = '"' @key (any - '"' )* @key_append '"'; value = '"' @value (any - '"' )* @value_append '"'; assignment = whitespace* key whitespace* "=" whitespace* value whitespace* @assignment; One thing that still seems problematic are escaped quotes though. "this here \"test\" is a" Wondering what the approach is to express this. I was thinking something along the lines of key = '"' @key (any - ([^\\] '"') )* @key_append '"'; ...but that obviously doesn't work as hoped. Any pointers here? >> 2. I've had a look at the C grammar but did not really understand how >> the comment rules worked. I tried with that approach but I could not >> capture and access the comment text. > > See Chapter Four of the manual. Cool, I came up with something very similar. But now I have changed it to comment_c = "/*" @comment ((any @comment_append)* - (any* "*/" any*)) "*/"; comment_cpp = "//" @comment (any - "\n")* @comment_append "\n"; Thanks for the pointer. It just seems that my @comment_append method is not positioned correctly. I am still getting a trailing "*" for the "comment_c". Not sure I understand why. >> 4. What about unicode support? I've read that UTF8 should be possible. >> What about UTF16? > > Yes, parsing UTF16 is possible. Ragel is only concerned with processing > arrays of fixed size characters. These can be 1, 2, 4, etc bytes wide. > The rest is up to you. Sounds like converting UTF16 -> UTF8 and then use the proper byte sequences might be a little easier. I found the character sequence definitions here: http://git.wincent.com/wikitext.git?a=blob;f=ext/wikitext_ragel.rl action non_printable_ascii { c = *p & 0x7f; } action two_byte_utf8_sequence { c = ((uint32_t)(*(p - 1)) & 0x1f) << 6 | (*p & 0x3f); } action three_byte_utf8_sequence { c = ((uint32_t)(*(p - 2)) & 0x0f) << 12 | ((uint32_t)(*(p - 1)) & 0x3f) << 6 | (*p & 0x3f); } action four_byte_utf8_sequence { c = ((uint32_t)(*(p - 3)) & 0x07) << 18 | ((uint32_t)(*(p - 2)) & 0x3f) << 12 | ((uint32_t)(*(p - 1)) & 0x3f) << 6 | (*p & 0x3f); } (0x01..0x1f | 0x7f) @non_printable_ascii | (0xc2..0xdf 0x80..0xbf) @two_byte_utf8_sequence | (0xe0..0xef 0x80..0xbf 0x80..0xbf) @three_byte_utf8_sequence | (0xf0..0xf4 0x80..0xbf 0x80..0xbf 0x80..0xbf) @four_byte_utf8_sequence Still trying to figure out to use those though :) Is there any other example available somewhere? cheers -- Torsten _______________________________________________ ragel-users mailing list [email protected] http://www.complang.org/mailman/listinfo/ragel-users
