Re: Incremental parser (was: Backtick Hickup)

Michel Fortin Mon, 27 Aug 2007 09:42:33 -0700

Le 2007-08-19 à 1:07, Allan Odgaard a écrit :

But I'm not the one in charge of that page. I'd suggest checkingthe testsuites announced on this list: most decisions regardingedge cases have been "documented" there as regression tests. Ifsome behaviour is part of the test suite, you can be pretty muchcertain that it's not a parser quirk.
I have not looked at these, that is, I did look at Gruber’soriginal test suite, and it basically just tested a lot of simplecases. This is IMO *not* the way to go about things, i.e. defininga grammar based on a lot of test cases.

You're complaining about the lack of precision in the syntaxdefinition (a valid complain). I can't really address that complain(the document is not under my control), but I'm trying to help bypointing out that some testcases (not all obviously) includes someclues not found in the documentation. Obviously, and as you say, itisn't a replacement for a more precise documentation.

Now that I think about that, there's probably a couple of thingswhich are only defined in the version history too (especiallyMarkdown.pl 1.0.1's history).

Take e.g. this letter from last year http://six.pairlist.net/pipermail/markdown-discuss/2006-August/000151.html -- here I talkabout the problems which arise from mixing nested block elementsand the lazy-mode rules. I think this should be clearly defined,not just defined via test cases, because we need clear rules, notrecipes about how to handle a few select cases.

One thing is certain: any output which is invalid HTML is a bug.Beyond that, some things are unintended (bugs) and some things are asintended (but not always documented). Some things are documented andeasy to find (in the syntax description), others are "documented" butburied deeper in the version history or in test cases.


So, again, we agree that the documentation is suboptimal.

[...]
Ok, back to performance.
Just to be clear, my motivation here is *not* performance. Mymotivation is getting Markdown interpreted the same in differentcontexts, which it presently isn’t always, i.e. to get a clearlydefined syntax, so I can ensure that the highlight in TM followsthe standard to the point (and thus the syntax highlight you getfollows the look of the post to your blog, the HTML constructedfrom Markdown.pl, or the local preview of the Markdown done withredcloth).

Okay. Then on that goal I'm with you. The less divergence there isbetween implementations the better.

How many time do you start a new Perl process when building themanual?
[...]
Is the manual available in Markdown format somewhere? I'd like todo some benchmarking.
http://six.pairlist.net/pipermail/markdown-discuss/2006-August/000152.html


Great, thank you. Posting some results in a new thread right now...

I'm totally not convinced that creating a byte-by-byte parser inPerl or PHP is going to be very useful.
The key here is really having clearly defined state transitions.


I'm not sure what you mean by that in relation to what I wrote above.

I am *not* touting TM’s parser as fast, I am trying to convince youthat the current way things are done, is pretty bad, and bad formany reasons, the (lack of) formalness with which the grammar isdefined, the (lack of) simplicty in the code (and thusextensibility of the language grammar), and also (lack of)performance (by how the current implementation effectively does notsupport nested constructs, and thus have to fake it by doingiterative manglings of subsets of the document, to treat that as anested” environment, complicated a lot by how it is documented tosupport embedded HTML (untouched by the Markdown parser, but inpractice some edge cases are not handled correctly here)).

There are many complains about different things here. About thesyntax, you complain that it is badly defined (I agree).

You then talk about lack of simplicity in the code, which I assumeapply to Markdown.pl (or PHP Markdown), not the syntax; or perhapsyou mean that the syntax makes it impossible to write simple code toparse it? I'm not sure I understand what you mean here.

Then you talk about the lack of extensibility of the language grammar(which I'm not sure what you mean by that, is there a languagegrammar for Markdown anyway?). Then you go on the lack of performance(are you calling this a syntax or parser issue or both?).

Finally you say the current implementation (I assume you're talkingabout Markdown.pl, perhaps PHP Markdown) does not "effectively"support nested constructs (which constructs? what does "effectivly"means here?) but "support" them somewhat by recursively reparsingparts of the document. Very true, but how is that a problem for you?

I assume the later is a problem for you if you take every quirks andbugs and try to reproduce them with an incremental parser: it getsneedlessly complicated. I don't think that's the way to go if youwant to produce an incremental parser.

[...] If you wish to create a better definition of the language,I'll be glad to help by answering questions I can answer,exemplifying edge cases and their desirable outputs, etc.
We pretty much went over that last year, and I thought I had madethe point by now, that what I am after is defining the syntax, notthe edge-cases -- I can read how Markdown.pl deals with them myself(although it deals with several by producing invalid code).

Yeah, but let me explain better what I meant by this (today, and lastyear too)...

Basically, I'm not going to start a formal grammar for Markdown fromscratch on my own. I'd be glad to help though.

You seem to have already done a good part of the job by writhing TM'sparser. While not perfect, I think a formal grammar based on it (orperhaps something else such as Pandoc) could be a great starting point.

Once we have this, it'll be easier for me and others to comment on,and to spot any difference with current Markdown.pl. Some differenceswill be errors or unindented side effects on Markdown.pl's part whichthe formal syntax should ignore, others will be the indented outputand will need to be "ported" to the grammar. These two things are notalways easy to distinguish, and for that I can help since I knowpretty well Markdown.pl inwards (which are mostly the same as PHPMarkdown).

So by this process, I believe we can evolve the formal syntax to apoint where it handles things pretty well. It can't be *the* formaldefinition without John's approval, but it could certainly serve as abetter reference for other implementors than Markdown.pl will ever be.

If you want the syntax changed so that it better fit your parser(and possibly other incremental parsers) then I can provide mypoint of view, but I'm not the one who takes the final decision.
Unfortunately Gruber is dead silent when it comes to this.

Some things are certainly going to stay ambiguous without someinsight from John, but there's still a lot that can be done without it.

It may come off as self-serving to approach things from thetraditional incremental-parser (formal grammar / BNF) POV, but itis because I really think this would be best for bringing allimplementations of the Markdown parser in sync, give betterperformance, not have as many broken edge-cases as now, and havethe tools provide accurate syntax highlight.

I don't really want to see the syntax changed in and out only to makeit easier to implement as an incremental parser. I don't think such aparser would be usable (read fast-enough) in PHP anyway. Well,perhaps it could be, but not in the traditional sense of anincremental parser; the concept would probably need to be stretched alot to fit with regular expressions.

Already there are several forks of Markdown (i.e. where stuff isadded to the syntax), so I don’t think the best approach (for me)would be to start yet another fork -- Markdown should be onestandard, not a dozen different ones, and that is why I am so keenon having a clearly defined standard.

If you don't add features or don't do things otherwise than thedocumentation says, you don't have to call it a fork. That the syntaxis unclear for a couple of things doesn't imply that an attempt atclarifying it is forking. Better call it a one of the multiplepossible interpretations of the syntax as currently defined. And ifthat straightened up syntax is good enough, it could become by itselfa de-facto reference implementation for other implementors.

Yes, and personally I would say whenever you do [foo][bar] you geta link, regardless of whether or not bar is a defined reference --if bar is not a defined reference, you could default to make itreference the URL ‘#’ -- this makes parsing *much* easier (here Iam thinking about the case where you do: ‘*dum [foo*][bar]’ or‘[*foo][bar] dum*’. The 3 reasons for choosing this rule is that 1)partial documents are tokenized the same as full document (considerthat my references may be from an external file, yet some stuff maystill work on the “partial” document (i.e. the one w/o the actualbibliography, such as a local preview and the syntax highlight), 2)no-one would likely make use of the “feature” that [foo][bar] isthe raw text [foo][bar] when bar is undefined (this is equivalentto saying that foo should keep as literal text,since no was found), and 3) it really is easier for the user torelate to “the pattern [something][something] is a link”.

Hum, I disagree strongly here that creating links to nowhere (#) isthe solution to undefined reference links. This is bad usability forauthors who will need to test every links in resulting page to makesure they're linking where they should be, and for readers who willclick a link expecting to get somewhere but getting nowhere. Leavingit as text makes it clear for everyone that there is no link there(whatever the authors' intent) and makes authors more likely to findtheir error by visually inspecting the browser rendering of the output.

A much better compromise in my opinion would be to just treat thesebrakets specially and not allow emphasis in the cases above. I'm notentirely sure that's the ideal thing to do, but I don't really expectanyone to do emphasis like that consciously (except as a test case),so it's probably a good enough solution.



Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: Incremental parser (was: Backtick Hickup)

Reply via email to