Re: Incremental parser (was: Backtick Hickup)

Allan Odgaard Sat, 18 Aug 2007 22:07:55 -0700

On Aug 14, 2007, at 9:41 AM, Michel Fortin wrote:

[...]
I agree that the syntax needs to be defined more clearly.

I am glad that we are finally reaching agreement on this. You may notrecall, but a year ago you asked me: “is it so much important thatthese border cases be consistent across all implementations?” [1]

[1]: http://six.pairlist.net/pipermail/markdown-discuss/2006-July/000146.html

I think the syntax page should be updated when we find an ambiguity.

That would be nice, yes -- but IMO we need to take a step back andreally define the syntax in a more formal way, cause just clarifyinga lot of border cases is tedious and complex. Doing something closerto a real grammar would not leave us with all these ambiguities inthe first place, as stated, this is also why I brought up theincremental parser, because this works based on a state machine, astate machine has a clear transition from state to state, based onthe input, not the present ad-hoc parser.

But I'm not the one in charge of that page. I'd suggest checkingthe testsuites announced on this list: most decisions regardingedge cases have been "documented" there as regression tests. Ifsome behaviour is part of the test suite, you can be pretty muchcertain that it's not a parser quirk.

I have not looked at these, that is, I did look at Gruber’s originaltest suite, and it basically just tested a lot of simple cases. Thisis IMO *not* the way to go about things, i.e. defining a grammarbased on a lot of test cases.

Take e.g. this letter from last year http://six.pairlist.net/pipermail/markdown-discuss/2006-August/000151.html -- here I talkabout the problems which arise from mixing nested block elements andthe lazy-mode rules. I think this should be clearly defined, not justdefined via test cases, because we need clear rules, not recipesabout how to handle a few select cases.

[...]
Syntax highlighting isn't the same thing as "parsing" Markdown, notin my mind. It's more like "tokenizing" Markdown [...]

But building a parse-tree is pretty easy if you have alreadytokenized Markdown correctly. Anyway, TM does build the parse-tree aswell. This is slightly beyond the point though, I was just saying TMdoes take the “incremental approach”, and it works quite well for theactual documents out there.

[...]
Ok, back to performance.

Just to be clear, my motivation here is *not* performance. Mymotivation is getting Markdown interpreted the same in differentcontexts, which it presently isn’t always, i.e. to get a clearlydefined syntax, so I can ensure that the highlight in TM follows thestandard to the point (and thus the syntax highlight you get followsthe look of the post to your blog, the HTML constructed fromMarkdown.pl, or the local preview of the Markdown done with redcloth).

How many time do you start a new Perl process when building themanual?
[...]
Is the manual available in Markdown format somewhere? I'd like todo some benchmarking.

http://six.pairlist.net/pipermail/markdown-discuss/2006-August/000152.html

I'm totally not convinced that creating a byte-by-byte parser inPerl or PHP is going to be very useful.


The key here is really having clearly defined state transitions.

Using regular expressions is much faster than executing equivalentPHP code in my experience [...] I'd be surprised if it [PHPMarkdown / Markdown.pl] ever reach the speed of TextMate'sdedicated parsing state machine.

TM has a language grammar declaration where each token is defined bya regexp -- it could be a lot faster if a dedicated parser waswritten, but my goal here was flexibility, not speed.

I am *not* touting TM’s parser as fast, I am trying to convince youthat the current way things are done, is pretty bad, and bad for manyreasons, the (lack of) formalness with which the grammar is defined,the (lack of) simplicty in the code (and thus extensibility of thelanguage grammar), and also (lack of) performance (by how the currentimplementation effectively does not support nested constructs, andthus have to fake it by doing iterative manglings of subsets of thedocument, to treat that as a nested” environment, complicated a lotby how it is documented to support embedded HTML (untouched by theMarkdown parser, but in practice some edge cases are not handledcorrectly here)).

[...] If you wish to create a better definition of the language,I'll be glad to help by answering questions I can answer,exemplifying edge cases and their desirable outputs, etc.

We pretty much went over that last year, and I thought I had made thepoint by now, that what I am after is defining the syntax, not theedge-cases -- I can read how Markdown.pl deals with them myself(although it deals with several by producing invalid code).

If you want the syntax changed so that it better fit your parser(and possibly other incremental parsers) then I can provide mypoint of view, but I'm not the one who takes the final decision.


Unfortunately Gruber is dead silent when it comes to this.

It may come off as self-serving to approach things from thetraditional incremental-parser (formal grammar / BNF) POV, but it isbecause I really think this would be best for bringing allimplementations of the Markdown parser in sync, give betterperformance, not have as many broken edge-cases as now, and have thetools provide accurate syntax highlight.

Already there are several forks of Markdown (i.e. where stuff isadded to the syntax), so I don’t think the best approach (for me)would be to start yet another fork -- Markdown should be onestandard, not a dozen different ones, and that is why I am so keen onhaving a clearly defined standard.

[...]
There's a tricky case here however: [foo][bar] isn't a link inMarkdown unless "bar" is defined somewhere; if it isn't defined,it's plain text. That may seem like an edge case right now, butwhen/if Markdown gets the [shortcut link] syntax (as added to thecurrent betas of 1.0.2), this may become a more interesting problemfor syntax highlighting as any bracketed text will then become apotential link depending on whether or not it has been definedelsewhere in the document.

Yes, and personally I would say whenever you do [foo][bar] you get alink, regardless of whether or not bar is a defined reference -- ifbar is not a defined reference, you could default to make itreference the URL ‘#’ -- this makes parsing *much* easier (here I amthinking about the case where you do: ‘*dum [foo*][bar]’ or ‘[*foo][bar] dum*’. The 3 reasons for choosing this rule is that 1) partialdocuments are tokenized the same as full document (consider that myreferences may be from an external file, yet some stuff may stillwork on the “partial” document (i.e. the one w/o the actualbibliography, such as a local preview and the syntax highlight), 2)no-one would likely make use of the “feature” that [foo][bar] is theraw text [foo][bar] when bar is undefined (this is equivalent tosaying that foo should keep as literal text, since no was found), and 3) it really is easier for the user to relate to“the pattern [something][something] is a link”.


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: Incremental parser (was: Backtick Hickup)

Reply via email to