On Aug 27, 2007, at 10:35 PM, Michel Fortin wrote:

Personally, as I have said before, the back-tick rules are confusing (when you want to include a back-tick in the code) and we might be better off by just defining some simpler rules.
I don't find them confusing, but perhaps it's only because I'm used to it. Which aspect of it do you find confusing?

Maybe ‘intuitive’ would have been a better choice of word. But this thread started because somebody did not understand how to embed back- ticks in back-tick quoted strings -- personally I didn’t understand it either until I looked at the implementation.

[...]
I think I prefer the current behaviour. I can't really see when having to escape the content of code span would be useful. Perhaps you had something in mind when proposing that?

Yes, when you need special characters -- you can’t use entities inside `…` so ``…`` would allow you to do e.g. \u2620 for a unicode character or similar -- with everybody using utf-8 these days (knock on wood) escape codes for special characters are less useful than in the past.

[...]
I have some difficulty figuring out an what you mean by "embeded HTML does not lean itself well to the 'split the document into paragraphs'".

Markdown currently distinguish block-level HTML elements from span- level HTML elements: The former creates blocks which are left alone by Markdown (and left outside paragraphs) while the later gets wrapped into paragraphs (as valid HTML expects them to be) along with Markdown-formatted text.

Yes, we are dependent on Markdown finding the HTML before it does the paragraph splitting, so it doesn’t insert <p> in my HTML -- yet the present heuristic to find HTML is easily confused (talking Markdown.pl), for me it actually got worse when John switched to the Perl library thing.

In fact, presently I have my own preprocessor for my Markdown pages (on my site, which sometimes need to embed tables and stuff) to take out the HTML before giving it to Markdown -- although this is also because Markdown does not know about <% scripting %> <?php tags ?> and since there is no grammar where I can just educate it about them, I need to handle that myself in a pre-parse step.

Anyway, if we agree that everything is dependent on everything that precedes it, I think we can slowly start to agree that *also* having things depend on what follows, is problematic.
Well, I think you mean problematic for writing a parser, in which case I disagree.

No, I mean problematic as in; what the hell should we do? You and I disagree about how to interpret the same line of Markdown exactly because it depends on the angle you view it from (read: which token you think is most important), i.e. totally subjective…

The “syntax” quickly becomes the implementation [...]
Well, look at how the WHATWG is defining HTML right now: it's exactly that. They describe how the parser works (in english), and everything that match its behaviour is conforming...

Yes, and do you know *why* they are doing that?

It is because all the initial browsers had no scent of a real parser, they (seriously!) did things like:

   if(strcmp("<b>", tag))
      bold = true;
   else if(strcmp("</b>", tag))
      bold = false;
   …

Even though there was an official specification for how to parse HTML (well, SGML), no browsers actually did it that way, and authors did lots of totally broken pages, and browsers interpreted them differently, and browsers didn’t even interpret valid HTML correct (i.e. there are e.g. the rule that when you close a context in SGML, all missing close tags are implicit, and I haven’t seen a single browser actually do that, even though it is actually a quite nice feature, since you can leave out lots of close tags -- but since they did not have a recursive descent parser or similar, they had no clue what the current context was, so that is likely why they didn’t do it, that and the fact that they probably never read the SGML specification), etc.

So W3C said fuck this, let’s totally scrap SGML, it was too complex for browser implementors to wrap their head around (understandable!), so let’s do a “simple” subset (XML, which turned out to be not that simple in the end when they retrofitted namespaces and all sorts of crap into it) and XHTML is the new thing, totally strict! But no-one cared about XHTML, no browser really supported it, because we have like billions HTML pages out there, we can’t just drop them.

So given this rather broken situation, the WhatWG decided to try to figure out in which ways all the browsers were broken and document that to get them in sync, and make that the official spec, so that we can move on with (expanding) the HTML specification w/o cutting backwards compatibility -- because browser vendors don’t want existing pages to break, cause that makes them lose users, so if W3C adds features to HTML which require the browser to have a strict parser to really work, browser vendors may not do it because of backwards compatibility, or something like that…

You really think Markdown should take the same route? ;)

which brings out an interesting side topic: how should HTML be parsed (or event specified) within Markdown? :-)

I would say strict (for which a grammar is pretty simple)! There is no reason Markdown should conform to the looser WhatWG definition, since strict HTML is a subset of WhatWG’s definition, and they made a superset only to be compatible with existing bad pages, but Markdown does not need to support that.

[...]
I think the better solution to that problem would be to disallow emphasis starting in the middle of a word ending within another. And as for underscore-emphasis problem, I'd suggest doing just as PHP Markdown Extra does (one of its *documented* features): only allow it on word boundaries, not in the middle of a word. I've yet to get a complain about that change in behaviour and I know some people switched to Extra just because of that.

I would prefer that interpretation as well, I have even requested it in the past, since it is the #1 mistake I see from people who post comments on my blog (they do not escape underscores in snake_case_words or surround them with back-ticks). I can’t find the thread, but most thought it was useful, but it is not uncommon that people argue for one behavior that they are actually not ever using in practice.

It sounds like I should switch to Markdown Extra for my blog comments…

[...] I am *not* talking about documenting every single edge case, I am talking about defining the syntax using more traditional means of defining syntaxes.
But how can you write a formal grammar without having to think of the edge cases? Are you suggesting we should ignore edge cases when defining the syntax? And if yes, what does qualify as an "edge case"?

I don’t think you have worked with parser generators and grammars.

Basically the implementation is generated from the grammar (if it is possible to specify the grammar fully) -- the grammar can be tricky to get right, but there are no edge-cases luring in the corners in the same way that there is for a hand-written multi-pass regular parser.

I.e. compare it to a mathematical equation, people can’t get to different solutions for the same equation unless they are misreading it.

This is why a formal grammar is so powerful, because it really specifies *everything* -- the problem can be if it specifies it the way we want it. E.g. in the short example I gave, I did not support self-closing HTML tags in paragraph text, so this is simply not supported, no argument there -- the argument is thus whether we should add it to the grammar, not how to interpret the grammar.

That said, a grammar can be ‘invalid’ so to speak -- but if specified e.g. as an ANTLR grammar, ANTLR will tell us which rules cause which problems.



_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to