For the past year or so I’ve been playing around with the idea of developing a programming language specifically for expressing translations between different formal languages (in the programming language sense, not natural languages).
A document format used for word processing/other office-style applications can be considered a formal language, where it is possible to specify its structure in some formal notation, such as BNF. ODF and OOXML both fit this criteria, as they use XML and their tree structure is defined by the RelaxNG schemas that are part of the specifications - there are copies of these in the ‘schemas’ directory in the Corinthia repository. The zip-packaging complicates this a little, but it would be feasible to find a way to express this formally in terms of what files are required in a package, how they relate to one another, and what schema each must conform to. I believe that ODF, OOXML, and HTML are not the only file formats we should consider supporting in Corinthia - there are numerous others, including: - Markdown [1] / MultiMarkdown [2] / CommonMark [3] - reStruturedText [4] (popular in the Python world) - AsciiDoc [5] - RTF [6] - LaTeX [7] - Various Wiki markup languages All of these languages are difficult to parse manually, and really need to be tackled through the use of a parser generator or other language processing toolkit. Let’s face it, C sucks for what we’re doing, as it’s too low level and we constantly have to take care of manual memory management, awkward string handling, and protecting against crashes that can so often arise due to NULL pointer dereferences, memory corruption, use-after-free and other bugs. The Objective C version of the code was better, but still not ideal, and not cross platform. One programming language I’ve been especially inspired by in some past work I’ve done is Stratego/XT [8]. It includes the following features: - A grammar specification language (SDF), which can be used as input to parser generators, which will automatically construct abstract syntax trees, expressed in the ATerm format (which is similar to Lisp’s S-expressions). - The Stratego language, which allows specification of rewrite rules based on pattern matching, in a notation similar to that used to express the denotational semantics of programming languages. The language also includes the ability to express strategies for how rewrite rules should be applied, such as pre-order or post-order traversal of the syntax tree, repeated attempts at rule application, and so forth; these strategies can be composed in arbitrary ways. The language is also a powerful functional programming language in it’s own right, and has many other features for analysis of syntax trees to determine things like variable scopes. - A pretty-printing language for concisely describing how an abstract syntax tree can be turned back into a text file in a given language, for later compilation or editing. Sadly, Stratego/XT itself seems to have fallen into a state of disrepair, and many of the links on the website are broken (but can probably be dug up with a bit of work). Their publications page (http://strategoxt.org/Stratego/StrategoPublications <http://strategoxt.org/Stratego/StrategoPublications>) has plenty of good info though. They’ve now moved on to what is called the Spoofax Language Workbench which, even more sadly, depends on both Java and Eclipse. I’m sure it’s possible to get the core language operating standalone, but I haven’t looked into it. Anyway, Stratego fundamentally changed the way I thought about the process of writing compilers, and opened my eyes to a whole world of possibilities in program transformation. I've found it useful in several different contexts, such as instrumenting code for performance measurement, and detecting plagiarism in student code submissions for CS assignments. Another important development in the last decade or so has been Parsing Expression Grammars (PEGs), and packrat parsing [10]. The two key papers you want to read on these are [11] and [12], although Bryan Ford and others have published many other papers on the technique. PEGs are context-aware grammars that combine both lexical and syntactic analysis (unlike the separation seen in flex & bison). They are readily composable, and avoid many of the ambiguity problems inherent with context-free grammars, such as the “dangling else” problem. A good example of a PEG-based parser implementation is Grako [13], which is implemented in Python. Having seen these various things, and thinking about how to make the process of writing document format translators easier, I’m of the view that it would be immensely advantageous for us to have these concepts combined into a single language/runtime library. In principle, this could be developed independently of Corinthia, but I think would be better done as part of the project as it’s something which we have a very direct application for, namely writing filters for the above-mentioned file formats, and possibly porting the OOXML filter and ODF filter across to as well. It’s also been several years since I’ve had a chance to sink my teeth into some real programming language research, and I’m keen to get back to my roots. I’ve done about seven or eight prototypes so far, in both C and more recently Python - so far addressing only the parsing and syntax tree construction aspects, without any thought yet as to the language itself. I’m currently spending a lot of time with the famous “Dragon Book”, reviewing the chapters on parsing and syntax-directed translation, and implementing the algorithms described in the text to solidify my understanding of how they work. I can share the code if you’re interested, though right now it’s all completely undocumented and nowhere near ready for use by others, and you’ll probably find either the dragon book itself or other online tutorials to be more useful. I should emphasis that it’s all throwaway code, and I intend on doing a “proper” implementation from scratch within the Corinthia repository. I’ve pretty much figured out what needs to be done for the parsing and syntax tree construction, and the next step is coming up with a suitable runtime representation, that can be used both directly from C code, and from the “Flat” language (whatever form it eventually takes), and potentially bindings to other languages like Python and JavaScript. What I envisage here is an approach in which all such data structures are immutable, structured using ATerms (which Stratego uses, and are like S-expressions), and have transformations expressed in a purely functional style, with Stratego-style pattern matching. Our current runtime representation of data obtained from source files is XML, since this is what the currently-supported formats are based on. We could either stick with this, translating to and from the ATerm-style syntax where appropriate, or (and I would argue this is better), to use a common type system for all transformations. The advantage of the latter is that all transformations could be subject to type checking, which means the Flat compiler would statically guarantee that the output produced by e.g. a HTML to ODF translation fully conforms to the ODF schema, the same way that a C/Java compiler does type checking. This would allow us to have formally-verified translations between file formats. The transformations would not necessarily have to use code generation; we could alternatively implement them via an interpreter. Both could be supported, with back-ends that output in C code, or bytecode that is understood by the interpreter. In either case, there will need to be runtime support consisting of a garbage collector and likely an exception handling mechanism; a debugger and a Python-style REPL interface would also be handy for development purposes. As an example of the type of grammar I’m thinking of, see the following one for MultiMarkdown (but imagine it without the semantic actions, which are written in C): https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg <https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg> I still have a lot of reading to do on many topics, and I’ve been meaning to start discussing it here on the list for a long time but have yet to get around to it. Sorry for dumping so much stuff at once (and I realise it’s pretty heavy), but at least now we can start to get the discussion going about what we want in a transformation language, and how we can make the development of new filters easier. [1] http://daringfireball.net/projects/markdown/syntax <http://daringfireball.net/projects/markdown/syntax> [2] http://fletcherpenney.net/multimarkdown/ <http://fletcherpenney.net/multimarkdown/> [3] http://commonmark.org <http://commonmark.org/> [4] http://docutils.sourceforge.net/rst.html <http://docutils.sourceforge.net/rst.html> [5] http://www.methods.co.nz/asciidoc/ <http://www.methods.co.nz/asciidoc/> [6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 <http://www.microsoft.com/en-US/Download/details.aspx?id=10725> [7] http://www.latex-project.org <http://www.latex-project.org/> [8] http://strategoxt.org <http://strategoxt.org/> [9] http://strategoxt.org/Spoofax/WebHome <http://strategoxt.org/Spoofax/WebHome> [10] http://bford.info/packrat/ <http://bford.info/packrat/> [11] http://bford.info/pub/lang/packrat-icfp02.pdf <http://bford.info/pub/lang/packrat-icfp02.pdf> [12] http://bford.info/pub/lang/peg.pdf <http://bford.info/pub/lang/peg.pdf> [13] https://pypi.python.org/pypi/grako/3.6.1 <https://pypi.python.org/pypi/grako/3.6.1> [14] http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 <http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811> — Dr Peter M. Kelly pmke...@apache.org PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)