Le 2007-08-19 à 1:07, Allan Odgaard a écrit :

But I'm not the one in charge of that page. I'd suggest checking the testsuites announced on this list: most decisions regarding edge cases have been "documented" there as regression tests. If some behaviour is part of the test suite, you can be pretty much certain that it's not a parser quirk.

I have not looked at these, that is, I did look at Gruber’s original test suite, and it basically just tested a lot of simple cases. This is IMO *not* the way to go about things, i.e. defining a grammar based on a lot of test cases.

You're complaining about the lack of precision in the syntax definition (a valid complain). I can't really address that complain (the document is not under my control), but I'm trying to help by pointing out that some testcases (not all obviously) includes some clues not found in the documentation. Obviously, and as you say, it isn't a replacement for a more precise documentation.

Now that I think about that, there's probably a couple of things which are only defined in the version history too (especially Markdown.pl 1.0.1's history).

Take e.g. this letter from last year http://six.pairlist.net/ pipermail/markdown-discuss/2006-August/000151.html -- here I talk about the problems which arise from mixing nested block elements and the lazy-mode rules. I think this should be clearly defined, not just defined via test cases, because we need clear rules, not recipes about how to handle a few select cases.

One thing is certain: any output which is invalid HTML is a bug. Beyond that, some things are unintended (bugs) and some things are as intended (but not always documented). Some things are documented and easy to find (in the syntax description), others are "documented" but buried deeper in the version history or in test cases.

So, again, we agree that the documentation is suboptimal.


[...]
Ok, back to performance.

Just to be clear, my motivation here is *not* performance. My motivation is getting Markdown interpreted the same in different contexts, which it presently isn’t always, i.e. to get a clearly defined syntax, so I can ensure that the highlight in TM follows the standard to the point (and thus the syntax highlight you get follows the look of the post to your blog, the HTML constructed from Markdown.pl, or the local preview of the Markdown done with redcloth).

Okay. Then on that goal I'm with you. The less divergence there is between implementations the better.


How many time do you start a new Perl process when building the manual?
[...]
Is the manual available in Markdown format somewhere? I'd like to do some benchmarking.

http://six.pairlist.net/pipermail/markdown-discuss/2006-August/ 000152.html

Great, thank you. Posting some results in a new thread right now...


I'm totally not convinced that creating a byte-by-byte parser in Perl or PHP is going to be very useful.

The key here is really having clearly defined state transitions.

I'm not sure what you mean by that in relation to what I wrote above.


I am *not* touting TM’s parser as fast, I am trying to convince you that the current way things are done, is pretty bad, and bad for many reasons, the (lack of) formalness with which the grammar is defined, the (lack of) simplicty in the code (and thus extensibility of the language grammar), and also (lack of) performance (by how the current implementation effectively does not support nested constructs, and thus have to fake it by doing iterative manglings of subsets of the document, to treat that as a nested” environment, complicated a lot by how it is documented to support embedded HTML (untouched by the Markdown parser, but in practice some edge cases are not handled correctly here)).

There are many complains about different things here. About the syntax, you complain that it is badly defined (I agree).

You then talk about lack of simplicity in the code, which I assume apply to Markdown.pl (or PHP Markdown), not the syntax; or perhaps you mean that the syntax makes it impossible to write simple code to parse it? I'm not sure I understand what you mean here.

Then you talk about the lack of extensibility of the language grammar (which I'm not sure what you mean by that, is there a language grammar for Markdown anyway?). Then you go on the lack of performance (are you calling this a syntax or parser issue or both?).

Finally you say the current implementation (I assume you're talking about Markdown.pl, perhaps PHP Markdown) does not "effectively" support nested constructs (which constructs? what does "effectivly" means here?) but "support" them somewhat by recursively reparsing parts of the document. Very true, but how is that a problem for you?

I assume the later is a problem for you if you take every quirks and bugs and try to reproduce them with an incremental parser: it gets needlessly complicated. I don't think that's the way to go if you want to produce an incremental parser.


[...] If you wish to create a better definition of the language, I'll be glad to help by answering questions I can answer, exemplifying edge cases and their desirable outputs, etc.

We pretty much went over that last year, and I thought I had made the point by now, that what I am after is defining the syntax, not the edge-cases -- I can read how Markdown.pl deals with them myself (although it deals with several by producing invalid code).

Yeah, but let me explain better what I meant by this (today, and last year too)...

Basically, I'm not going to start a formal grammar for Markdown from scratch on my own. I'd be glad to help though.

You seem to have already done a good part of the job by writhing TM's parser. While not perfect, I think a formal grammar based on it (or perhaps something else such as Pandoc) could be a great starting point.

Once we have this, it'll be easier for me and others to comment on, and to spot any difference with current Markdown.pl. Some differences will be errors or unindented side effects on Markdown.pl's part which the formal syntax should ignore, others will be the indented output and will need to be "ported" to the grammar. These two things are not always easy to distinguish, and for that I can help since I know pretty well Markdown.pl inwards (which are mostly the same as PHP Markdown).

So by this process, I believe we can evolve the formal syntax to a point where it handles things pretty well. It can't be *the* formal definition without John's approval, but it could certainly serve as a better reference for other implementors than Markdown.pl will ever be.

If you want the syntax changed so that it better fit your parser (and possibly other incremental parsers) then I can provide my point of view, but I'm not the one who takes the final decision.

Unfortunately Gruber is dead silent when it comes to this.

Some things are certainly going to stay ambiguous without some insight from John, but there's still a lot that can be done without it.

It may come off as self-serving to approach things from the traditional incremental-parser (formal grammar / BNF) POV, but it is because I really think this would be best for bringing all implementations of the Markdown parser in sync, give better performance, not have as many broken edge-cases as now, and have the tools provide accurate syntax highlight.

I don't really want to see the syntax changed in and out only to make it easier to implement as an incremental parser. I don't think such a parser would be usable (read fast-enough) in PHP anyway. Well, perhaps it could be, but not in the traditional sense of an incremental parser; the concept would probably need to be stretched a lot to fit with regular expressions.

Already there are several forks of Markdown (i.e. where stuff is added to the syntax), so I don’t think the best approach (for me) would be to start yet another fork -- Markdown should be one standard, not a dozen different ones, and that is why I am so keen on having a clearly defined standard.

If you don't add features or don't do things otherwise than the documentation says, you don't have to call it a fork. That the syntax is unclear for a couple of things doesn't imply that an attempt at clarifying it is forking. Better call it a one of the multiple possible interpretations of the syntax as currently defined. And if that straightened up syntax is good enough, it could become by itself a de-facto reference implementation for other implementors.

Yes, and personally I would say whenever you do [foo][bar] you get a link, regardless of whether or not bar is a defined reference -- if bar is not a defined reference, you could default to make it reference the URL ‘#’ -- this makes parsing *much* easier (here I am thinking about the case where you do: ‘*dum [foo*][bar]’ or ‘[*foo][bar] dum*’. The 3 reasons for choosing this rule is that 1) partial documents are tokenized the same as full document (consider that my references may be from an external file, yet some stuff may still work on the “partial” document (i.e. the one w/o the actual bibliography, such as a local preview and the syntax highlight), 2) no-one would likely make use of the “feature” that [foo][bar] is the raw text [foo][bar] when bar is undefined (this is equivalent to saying that <p>foo</b></p> should keep </b> as literal text, since no <b> was found), and 3) it really is easier for the user to relate to “the pattern [something][something] is a link”.

Hum, I disagree strongly here that creating links to nowhere (#) is the solution to undefined reference links. This is bad usability for authors who will need to test every links in resulting page to make sure they're linking where they should be, and for readers who will click a link expecting to get somewhere but getting nowhere. Leaving it as text makes it clear for everyone that there is no link there (whatever the authors' intent) and makes authors more likely to find their error by visually inspecting the browser rendering of the output.

A much better compromise in my opinion would be to just treat these brakets specially and not allow emphasis in the cases above. I'm not entirely sure that's the ideal thing to do, but I don't really expect anyone to do emphasis like that consciously (except as a test case), so it's probably a good enough solution.


Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to