Re: Backtick Hickup

Allan Odgaard Sun, 19 Aug 2007 07:46:50 -0700

On Aug 14, 2007, at 9:45 AM, Michel Fortin wrote:

[...] Your interpretation of the syntax would require that:


    (mine)   ` `````````` `
    (your's) ``````````` `````````` ```````````

Well, showing that my interpretation of Gruber’s writings leads to alot of redundant back-ticks (in a fictional case) is not reallyshowing that my interpretation is wrong ;)

But based on the code for Markdown.pl it would seem that the standardhas an additional requirement, not made explicit in the syntaxdocument (namely that back-ticks must not follow the end-token).

Personally, as I have said before, the back-tick rules are confusing(when you want to include a back-tick in the code) and we might bebetter off by just defining some simpler rules.

My proposal (from the thoughts on a formal grammar) was to have`normal raw` and ``double-quoted raw`` where the latter would supportescape codes (at least \`).

But there are other options. Having escape-codes in raw though couldprove to be generally useful.

[...]
(There's also a check for a backslash at the start, although I justrealised that this needs work as it doesn't give a correct resultfor an escaped litteral backslash like this: \\`code`.)

And this is *exactly* why I think the current parser is so flawed,because you can’t look at things in isolation -- *everything* isdependent on what precedes it, not just the previous character, butevery single character that comes before the current one (granted, itseems that in practice, i.e. the standard inferred from how theparser actually works, things are dependent only on characterspreceding things _in the same paragraph_ -- but it seems to me thatthis is really just a side-effect of how the parser is written, andnot always desired. For example embedded HTML does not lean itselfwell to the “split the document into paragraphs”).

Anyway, if we agree that everything is dependent on everything thatprecedes it, I think we can slowly start to agree that *also* havingthings depend on what follows, is problematic. I.e. we turn parsinginto the chinese game of pickup sticks -- the way this is presently(mostly) solved is by doing iterative scans, where each iteration ishandling a given “token”, so rather than have the placement of thetoken in the document define the outcome (i.e. the closer it is tothe start of the document, the higher its precedence), it is based onthe order of the iterative scans (i.e. the first token “seen” by theparser, where it might be blind to `**` the first time it scans thedocument), take this example:


    This **is `raw** text`

Here we “naively” (i.e. regular parser) see the bold start-tokenfirst, and it is paired, but since Markdown scans for raw text beforebold text, it ends up as:


    <p>This **is <code>raw** text</code></p>

If we actually addressed this edge case in the standard, would wereally define the above to be the expected behavior? And if so, howdo we even document the general rule used here?

The “syntax” quickly becomes the implementation, because we wouldhave to define it like “first the document is broken into embedded-HTML parts and non-embedded HTML parts, the HTML embedded parts isfound using this heuristic: …, the non-embedded HTML parts are thenbroken into paragraphs (where a paragraph is defined using …), foreach paragraph we first scan for one or more back-ticks and see ifthere is an equal number in the same paragraph, if so, that part ismade raw, and that part is no longer worked on, and for the text tothe left and right of the raw text we do …” etc.

Such specification a) can lead to a lot of misunderstandings (alreadyin the above I neglected to mention how escaping ` will not cause acode-span, although Markdown 1.0.1 does turn \`this\` into <code>,but it seems the regexp you use, does not), and b) requires theparser to be written in a certain way which is rather non-standard,so parser tools cannot help in this.


A more formal approach would be something like semi-EBNF:

markdown: html | block-element

html: '<' ID attribute* '>' html* '</' ID '>'
    | '<' ID attribute* '/>'

block-element: heading | list | blockquote | raw | inline

heading: '#'+ inline | inline '\n' ('-'|'='){3,} '\n'

inline: (ESCAPE | bold | italic | code | link | PARA-TEXT)+

bold: '**' inline '**' | '__' inline '__'

code: s-q-code | d-q-code

s-q-code: '`' CODE+ '`'
d-q-code: '``' (CODE | ESCAPE)+ '``'

ID:        [A-Za-z][A-Za-z0-9]*
CODE:      [^`]
ESCAPE:    \.
PARA-TEXT: [^\n] | \n[^\n]
…

The above is written in Mail, and not meant to be exact, just give arough idea of what I am talking about, as I am not sure that isentirely clear to you.

And sure, we can’t get all the way with EBNF, but maybe we can get95% of the way, and that would be a tremendous win.

As I noted in my initial letter (last year about thoughts on a formalgrammar) we would (unfortunately) have to break with current behaviorfor (undocumented) edge-cases, like the raw above, since with theabove specification, it is the first token seen, that decides whichstyle to switch to -- we can still make requirements that it needs tobe paired, e.g.:


    This is **not bold.

Would not have `**` start bold. But personally, I am not favoringthat direction, mainly though because it easily leads to problemsparsing, but also because I am not sure it really is desired.


Take e.g. a paragraph like:

    You can set the SVN_EDITOR variable.

Now someone figures it would be good to append `(similar toCVS_EDITOR)`. This Now makes the full paragraph transform in anundesired way, even though the two sentences on their own transformfine, but when they follow each other, they do not (<em> isintroduced in the resulting HTML).

I know that it is stated somewhere that Markdown should be all aboutthe person and implementation complexity is irrelevant. The problemis that implementation complexity has lead us to the currentsituation where we have parsers doing different things (and syntaxhighlight not always being accurate) and we have lots of broken edge-cases and IMO unintuitive behavior -- so we got the implementationcomplexity, but I don’t think we have something which is “better”than had this followed more formal rules.

[...] That said, it's certainly the very edge of an edge case. Ifwe're to define a formal syntax, let's not start there.

As should hopefully be clear from the above, I am *not* talking aboutdocumenting every single edge case, I am talking about defining thesyntax using more traditional means of defining syntaxes.

Anyway, enough dead horse beating for now. Hopefully I’ll find timeto do a mostly complete parser based on EBNF for the current Markdownsyntax, and then I can bring up the topic again, listing thecompromises necessary for it to work, and the advantages/disadvantages may be more apparent then.


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: Backtick Hickup

Reply via email to