Inspired by Jan’s excellent idea of posting what we each plan to work on, I 
thought I’d chip in with my intentions:

- Complete development of a generic parser library based on Parsing Expression 
Grammars [1,2], which will serve as a basis for parsing non-XML based file 
formats like Markdown, AsciiDoc, reStructuredText, and RTF. This is something 
I’ve been dabbling with on and off for about a year now, and have recently done 
a complete rewrite of. I also forsee potential in extending this into a 
high-level programming language for expressing transformations similar to XSLT 
or Stratego/XT [3], but that’s something for a little further down the track.

I’ll put this code in a separate, experimental branch once it’s in a vaguely 
reasonable state - Real Soon Now (TM).

- Implement parsers for XML and HTML. Theoretically this could be done with the 
PEG-based parser above, but will be quicker and easier to do “manually”, as 
neither are very complicated to do. This will allow us to remove the external 
dependencies on libxml2, iconv, and htmltidy. I’ll likely actually do this 
first, given that it’s the easiest.

Note that given these dependencies will shortly be going away, I recommend 
against trying to isolate them in platform, as doing so will likely be more 
effort than writing the parsers themselves due to the dependencies on data 
structures used in core (specifically the DOM classes), which aren’t accessible 
from platform.

- Document more of the code base. This will include coding conventions - how 
things like error handling, memory management, and string 
representation/manipulation are carried out by the library. It will also cover 
the core classes and parts of the existing Word filter.

For those of you interested in formal language theory and parsing techniques, I 
recommend reading [4] which describes some of the history and recent 
developments such as packrat parsing which make for practical and simpler 
implementations of parsers for a more general range of languages than handled 
by LL/LR grammars of old. Flex and Bison users in particular should find this a 
relieving read :)

[1] Bryan Ford: Parsing expression grammars: a recognition-based syntactic 
foundation. POPL 2004: 111-122. http://bford.info/pub/lang/peg.pdf

[2] Bryan Ford: Packrat parsing: : simple, powerful, lazy, linear time, 
functional pearl. ICFP 2002: 36-47. 
http://bford.info/pub/lang/packrat-icfp02.pdf

[3] http://strategoxt.org

[4] Lennart C. L. Kats, Eelco Visser, Guido Wachsmuth: Pure and declarative 
syntax definition: paradise lost and regained. OOPSLA 2010: 918-932. 
http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2010-019.pdf

—
Dr Peter M. Kelly
[email protected]

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Reply via email to