On Aug 14, 2007, at 9:45 AM, Michel Fortin wrote:

[...] Your interpretation of the syntax would require that:

    (mine)   ` `````````` `
    (your's) ``````````` `````````` ```````````

Well, showing that my interpretation of Gruber’s writings leads to a lot of redundant back-ticks (in a fictional case) is not really showing that my interpretation is wrong ;)

But based on the code for Markdown.pl it would seem that the standard has an additional requirement, not made explicit in the syntax document (namely that back-ticks must not follow the end-token).

Personally, as I have said before, the back-tick rules are confusing (when you want to include a back-tick in the code) and we might be better off by just defining some simpler rules.

My proposal (from the thoughts on a formal grammar) was to have `normal raw` and ``double-quoted raw`` where the latter would support escape codes (at least \`).

But there are other options. Having escape-codes in raw though could prove to be generally useful.

[...]
(There's also a check for a backslash at the start, although I just realised that this needs work as it doesn't give a correct result for an escaped litteral backslash like this: \\`code`.)

And this is *exactly* why I think the current parser is so flawed, because you can’t look at things in isolation -- *everything* is dependent on what precedes it, not just the previous character, but every single character that comes before the current one (granted, it seems that in practice, i.e. the standard inferred from how the parser actually works, things are dependent only on characters preceding things _in the same paragraph_ -- but it seems to me that this is really just a side-effect of how the parser is written, and not always desired. For example embedded HTML does not lean itself well to the “split the document into paragraphs”).

Anyway, if we agree that everything is dependent on everything that precedes it, I think we can slowly start to agree that *also* having things depend on what follows, is problematic. I.e. we turn parsing into the chinese game of pickup sticks -- the way this is presently (mostly) solved is by doing iterative scans, where each iteration is handling a given “token”, so rather than have the placement of the token in the document define the outcome (i.e. the closer it is to the start of the document, the higher its precedence), it is based on the order of the iterative scans (i.e. the first token “seen” by the parser, where it might be blind to `**` the first time it scans the document), take this example:

    This **is `raw** text`

Here we “naively” (i.e. regular parser) see the bold start-token first, and it is paired, but since Markdown scans for raw text before bold text, it ends up as:

    <p>This **is <code>raw** text</code></p>

If we actually addressed this edge case in the standard, would we really define the above to be the expected behavior? And if so, how do we even document the general rule used here?

The “syntax” quickly becomes the implementation, because we would have to define it like “first the document is broken into embedded- HTML parts and non-embedded HTML parts, the HTML embedded parts is found using this heuristic: …, the non-embedded HTML parts are then broken into paragraphs (where a paragraph is defined using …), for each paragraph we first scan for one or more back-ticks and see if there is an equal number in the same paragraph, if so, that part is made raw, and that part is no longer worked on, and for the text to the left and right of the raw text we do …” etc.

Such specification a) can lead to a lot of misunderstandings (already in the above I neglected to mention how escaping ` will not cause a code-span, although Markdown 1.0.1 does turn \`this\` into <code>, but it seems the regexp you use, does not), and b) requires the parser to be written in a certain way which is rather non-standard, so parser tools cannot help in this.

A more formal approach would be something like semi-EBNF:

markdown: html | block-element

html: '<' ID attribute* '>' html* '</' ID '>'
    | '<' ID attribute* '/>'

block-element: heading | list | blockquote | raw | inline

heading: '#'+ inline | inline '\n' ('-'|'='){3,} '\n'

inline: (ESCAPE | bold | italic | code | link | PARA-TEXT)+

bold: '**' inline '**' | '__' inline '__'

code: s-q-code | d-q-code

s-q-code: '`' CODE+ '`'
d-q-code: '``' (CODE | ESCAPE)+ '``'

ID:        [A-Za-z][A-Za-z0-9]*
CODE:      [^`]
ESCAPE:    \.
PARA-TEXT: [^\n] | \n[^\n]
…

The above is written in Mail, and not meant to be exact, just give a rough idea of what I am talking about, as I am not sure that is entirely clear to you.

And sure, we can’t get all the way with EBNF, but maybe we can get 95% of the way, and that would be a tremendous win.

As I noted in my initial letter (last year about thoughts on a formal grammar) we would (unfortunately) have to break with current behavior for (undocumented) edge-cases, like the raw above, since with the above specification, it is the first token seen, that decides which style to switch to -- we can still make requirements that it needs to be paired, e.g.:

    This is **not bold.

Would not have `**` start bold. But personally, I am not favoring that direction, mainly though because it easily leads to problems parsing, but also because I am not sure it really is desired.

Take e.g. a paragraph like:

    You can set the SVN_EDITOR variable.

Now someone figures it would be good to append `(similar to CVS_EDITOR)`. This Now makes the full paragraph transform in an undesired way, even though the two sentences on their own transform fine, but when they follow each other, they do not (<em> is introduced in the resulting HTML).

I know that it is stated somewhere that Markdown should be all about the person and implementation complexity is irrelevant. The problem is that implementation complexity has lead us to the current situation where we have parsers doing different things (and syntax highlight not always being accurate) and we have lots of broken edge- cases and IMO unintuitive behavior -- so we got the implementation complexity, but I don’t think we have something which is “better” than had this followed more formal rules.

[...] That said, it's certainly the very edge of an edge case. If we're to define a formal syntax, let's not start there.

As should hopefully be clear from the above, I am *not* talking about documenting every single edge case, I am talking about defining the syntax using more traditional means of defining syntaxes.

Anyway, enough dead horse beating for now. Hopefully I’ll find time to do a mostly complete parser based on EBNF for the current Markdown syntax, and then I can bring up the topic again, listing the compromises necessary for it to work, and the advantages/ disadvantages may be more apparent then.

_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to