On Aug 14, 2007, at 9:41 AM, Michel Fortin wrote:

[...]
I agree that the syntax needs to be defined more clearly.

I am glad that we are finally reaching agreement on this. You may not recall, but a year ago you asked me: “is it so much important that these border cases be consistent across all implementations?” [1]

[1]: http://six.pairlist.net/pipermail/markdown-discuss/2006-July/ 000146.html

I think the syntax page should be updated when we find an ambiguity.

That would be nice, yes -- but IMO we need to take a step back and really define the syntax in a more formal way, cause just clarifying a lot of border cases is tedious and complex. Doing something closer to a real grammar would not leave us with all these ambiguities in the first place, as stated, this is also why I brought up the incremental parser, because this works based on a state machine, a state machine has a clear transition from state to state, based on the input, not the present ad-hoc parser.

But I'm not the one in charge of that page. I'd suggest checking the testsuites announced on this list: most decisions regarding edge cases have been "documented" there as regression tests. If some behaviour is part of the test suite, you can be pretty much certain that it's not a parser quirk.

I have not looked at these, that is, I did look at Gruber’s original test suite, and it basically just tested a lot of simple cases. This is IMO *not* the way to go about things, i.e. defining a grammar based on a lot of test cases.

Take e.g. this letter from last year http://six.pairlist.net/ pipermail/markdown-discuss/2006-August/000151.html -- here I talk about the problems which arise from mixing nested block elements and the lazy-mode rules. I think this should be clearly defined, not just defined via test cases, because we need clear rules, not recipes about how to handle a few select cases.

[...]
Syntax highlighting isn't the same thing as "parsing" Markdown, not in my mind. It's more like "tokenizing" Markdown [...]

But building a parse-tree is pretty easy if you have already tokenized Markdown correctly. Anyway, TM does build the parse-tree as well. This is slightly beyond the point though, I was just saying TM does take the “incremental approach”, and it works quite well for the actual documents out there.

[...]
Ok, back to performance.

Just to be clear, my motivation here is *not* performance. My motivation is getting Markdown interpreted the same in different contexts, which it presently isn’t always, i.e. to get a clearly defined syntax, so I can ensure that the highlight in TM follows the standard to the point (and thus the syntax highlight you get follows the look of the post to your blog, the HTML constructed from Markdown.pl, or the local preview of the Markdown done with redcloth).

How many time do you start a new Perl process when building the manual?
[...]
Is the manual available in Markdown format somewhere? I'd like to do some benchmarking.

http://six.pairlist.net/pipermail/markdown-discuss/2006-August/ 000152.html

I'm totally not convinced that creating a byte-by-byte parser in Perl or PHP is going to be very useful.

The key here is really having clearly defined state transitions.

Using regular expressions is much faster than executing equivalent PHP code in my experience [...] I'd be surprised if it [PHP Markdown / Markdown.pl] ever reach the speed of TextMate's dedicated parsing state machine.

TM has a language grammar declaration where each token is defined by a regexp -- it could be a lot faster if a dedicated parser was written, but my goal here was flexibility, not speed.

I am *not* touting TM’s parser as fast, I am trying to convince you that the current way things are done, is pretty bad, and bad for many reasons, the (lack of) formalness with which the grammar is defined, the (lack of) simplicty in the code (and thus extensibility of the language grammar), and also (lack of) performance (by how the current implementation effectively does not support nested constructs, and thus have to fake it by doing iterative manglings of subsets of the document, to treat that as a nested” environment, complicated a lot by how it is documented to support embedded HTML (untouched by the Markdown parser, but in practice some edge cases are not handled correctly here)).

[...] If you wish to create a better definition of the language, I'll be glad to help by answering questions I can answer, exemplifying edge cases and their desirable outputs, etc.

We pretty much went over that last year, and I thought I had made the point by now, that what I am after is defining the syntax, not the edge-cases -- I can read how Markdown.pl deals with them myself (although it deals with several by producing invalid code).

If you want the syntax changed so that it better fit your parser (and possibly other incremental parsers) then I can provide my point of view, but I'm not the one who takes the final decision.

Unfortunately Gruber is dead silent when it comes to this.

It may come off as self-serving to approach things from the traditional incremental-parser (formal grammar / BNF) POV, but it is because I really think this would be best for bringing all implementations of the Markdown parser in sync, give better performance, not have as many broken edge-cases as now, and have the tools provide accurate syntax highlight.

Already there are several forks of Markdown (i.e. where stuff is added to the syntax), so I don’t think the best approach (for me) would be to start yet another fork -- Markdown should be one standard, not a dozen different ones, and that is why I am so keen on having a clearly defined standard.

[...]
There's a tricky case here however: [foo][bar] isn't a link in Markdown unless "bar" is defined somewhere; if it isn't defined, it's plain text. That may seem like an edge case right now, but when/if Markdown gets the [shortcut link] syntax (as added to the current betas of 1.0.2), this may become a more interesting problem for syntax highlighting as any bracketed text will then become a potential link depending on whether or not it has been defined elsewhere in the document.

Yes, and personally I would say whenever you do [foo][bar] you get a link, regardless of whether or not bar is a defined reference -- if bar is not a defined reference, you could default to make it reference the URL ‘#’ -- this makes parsing *much* easier (here I am thinking about the case where you do: ‘*dum [foo*][bar]’ or ‘[*foo] [bar] dum*’. The 3 reasons for choosing this rule is that 1) partial documents are tokenized the same as full document (consider that my references may be from an external file, yet some stuff may still work on the “partial” document (i.e. the one w/o the actual bibliography, such as a local preview and the syntax highlight), 2) no-one would likely make use of the “feature” that [foo][bar] is the raw text [foo][bar] when bar is undefined (this is equivalent to saying that <p>foo</b></p> should keep </b> as literal text, since no <b> was found), and 3) it really is easier for the user to relate to “the pattern [something][something] is a link”.

_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to