For the past year or so I’ve been playing around with the idea of developing a 
programming language specifically for expressing translations between different 
formal languages (in the programming language sense, not natural languages).

A document format used for word processing/other office-style applications can 
be considered a formal language, where it is possible to specify its structure 
in some formal notation, such as BNF. ODF and OOXML both fit this criteria, as 
they use XML and their tree structure is defined by the RelaxNG schemas that 
are part of the specifications - there are copies of these in the ‘schemas’ 
directory in the Corinthia repository. The zip-packaging complicates this a 
little, but it would be feasible to find a way to express this formally in 
terms of what files are required in a package, how they relate to one another, 
and what schema each must conform to.

I believe that ODF, OOXML, and HTML are not the only file formats we should 
consider supporting in Corinthia - there are numerous others, including:

- Markdown [1] / MultiMarkdown [2] / CommonMark [3]
- reStruturedText [4] (popular in the Python world)
- AsciiDoc [5]
- RTF [6]
- LaTeX [7]
- Various Wiki markup languages

All of these languages are difficult to parse manually, and really need to be 
tackled through the use of a parser generator or other language processing 
toolkit.

Let’s face it, C sucks for what we’re doing, as it’s too low level and we 
constantly have to take care of manual memory management, awkward string 
handling, and protecting against crashes that can so often arise due to NULL 
pointer dereferences, memory corruption, use-after-free and other bugs. The 
Objective C version of the code was better, but still not ideal, and not cross 
platform.

One programming language I’ve been especially inspired by in some past work 
I’ve done is Stratego/XT [8]. It includes the following features:

- A grammar specification language (SDF), which can be used as input to parser 
generators, which will automatically construct abstract syntax trees, expressed 
in the ATerm format (which is similar to Lisp’s S-expressions).

- The Stratego language, which allows specification of rewrite rules based on 
pattern matching, in a notation similar to that used to express the 
denotational semantics of programming languages. The language also includes the 
ability to express strategies for how rewrite rules should be applied, such as 
pre-order or post-order traversal of the syntax tree, repeated attempts at rule 
application, and so forth; these strategies can be composed in arbitrary ways. 
The language is also a powerful functional programming language in it’s own 
right, and has many other features for analysis of syntax trees to determine 
things like variable scopes.

- A pretty-printing language for concisely describing how an abstract syntax 
tree can be turned back into a text file in a given language, for later 
compilation or editing.

Sadly, Stratego/XT itself seems to have fallen into a state of disrepair, and 
many of the links on the website are broken (but can probably be dug up with a 
bit of work). Their publications page 
(http://strategoxt.org/Stratego/StrategoPublications 
<http://strategoxt.org/Stratego/StrategoPublications>) has plenty of good info 
though. They’ve now moved on to what is called the Spoofax Language Workbench 
which, even more sadly, depends on both Java and Eclipse. I’m sure it’s 
possible to get the core language operating standalone, but I haven’t looked 
into it.

Anyway, Stratego fundamentally changed the way I thought about the process of 
writing compilers, and opened my eyes to a whole world of possibilities in 
program transformation. I've found it useful in several different contexts, 
such as instrumenting code for performance measurement, and detecting 
plagiarism in student code submissions for CS assignments.

Another important development in the last decade or so has been Parsing 
Expression Grammars (PEGs), and packrat parsing [10]. The two key papers you 
want to read on these are [11] and [12], although Bryan Ford and others have 
published many other papers on the technique. PEGs are context-aware grammars 
that combine both lexical and syntactic analysis (unlike the separation seen in 
flex & bison). They are readily composable, and avoid many of the ambiguity 
problems inherent with context-free grammars, such as the “dangling else” 
problem. A good example of a PEG-based parser implementation is Grako [13], 
which is implemented in Python.

Having seen these various things, and thinking about how to make the process of 
writing document format translators easier, I’m of the view that it would be 
immensely advantageous for us to have these concepts combined into a single 
language/runtime library. In principle, this could be developed independently 
of Corinthia, but I think would be better done as part of the project as it’s 
something which we have a very direct application for, namely writing filters 
for the above-mentioned file formats, and possibly porting the OOXML filter and 
ODF filter across to as well. It’s also been several years since I’ve had a 
chance to sink my teeth into some real programming language research, and I’m 
keen to get back to my roots.

I’ve done about seven or eight prototypes so far, in both C and more recently 
Python - so far addressing only the parsing and syntax tree construction 
aspects, without any thought yet as to the language itself. I’m currently 
spending a lot of time with the famous “Dragon Book”, reviewing the chapters on 
parsing and syntax-directed translation, and implementing the algorithms 
described in the text to solidify my understanding of how they work. I can 
share the code if you’re interested, though right now it’s all completely 
undocumented and nowhere near ready for use by others, and you’ll probably find 
either the dragon book itself or other online tutorials to be more useful. I 
should emphasis that it’s all throwaway code, and I intend on doing a “proper” 
implementation from scratch within the Corinthia repository.

I’ve pretty much figured out what needs to be done for the parsing and syntax 
tree construction, and the next step is coming up with a suitable runtime 
representation, that can be used both directly from C code, and from the “Flat” 
language (whatever form it eventually takes), and potentially bindings to other 
languages like Python and JavaScript. What I envisage here is an approach in 
which all such data structures are immutable, structured using ATerms (which 
Stratego uses, and are like S-expressions), and have transformations expressed 
in a purely functional style, with Stratego-style pattern matching.

Our current runtime representation of data obtained from source files is XML, 
since this is what the currently-supported formats are based on. We could 
either stick with this, translating to and from the ATerm-style syntax where 
appropriate, or (and I would argue this is better), to use a common type system 
for all transformations. The advantage of the latter is that all 
transformations could be subject to type checking, which means the Flat 
compiler would statically guarantee that the output produced by e.g. a HTML to 
ODF translation fully conforms to the ODF schema, the same way that a C/Java 
compiler does type checking. This would allow us to have formally-verified 
translations between file formats.

The transformations would not necessarily have to use code generation; we could 
alternatively implement them via an interpreter. Both could be supported, with 
back-ends that output in C code, or bytecode that is understood by the 
interpreter. In either case, there will need to be runtime support consisting 
of a garbage collector and likely an exception handling mechanism; a debugger 
and a Python-style REPL interface would also be handy for development purposes.

As an example of the type of grammar I’m thinking of, see the following one for 
MultiMarkdown (but imagine it without the semantic actions, which are written 
in C):

https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg 
<https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg>

I still have a lot of reading to do on many topics, and I’ve been meaning to 
start discussing it here on the list for a long time but have yet to get around 
to it. Sorry for dumping so much stuff at once (and I realise it’s pretty 
heavy), but at least now we can start to get the discussion going about what we 
want in a transformation language, and how we can make the development of new 
filters easier.

[1] http://daringfireball.net/projects/markdown/syntax 
<http://daringfireball.net/projects/markdown/syntax>
[2] http://fletcherpenney.net/multimarkdown/ 
<http://fletcherpenney.net/multimarkdown/>
[3] http://commonmark.org <http://commonmark.org/>
[4] http://docutils.sourceforge.net/rst.html 
<http://docutils.sourceforge.net/rst.html>
[5] http://www.methods.co.nz/asciidoc/ <http://www.methods.co.nz/asciidoc/>
[6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 
<http://www.microsoft.com/en-US/Download/details.aspx?id=10725>
[7] http://www.latex-project.org <http://www.latex-project.org/>
[8] http://strategoxt.org <http://strategoxt.org/>
[9] http://strategoxt.org/Spoofax/WebHome 
<http://strategoxt.org/Spoofax/WebHome>
[10] http://bford.info/packrat/ <http://bford.info/packrat/>
[11] http://bford.info/pub/lang/packrat-icfp02.pdf 
<http://bford.info/pub/lang/packrat-icfp02.pdf>
[12] http://bford.info/pub/lang/peg.pdf <http://bford.info/pub/lang/peg.pdf>
[13] https://pypi.python.org/pypi/grako/3.6.1 
<https://pypi.python.org/pypi/grako/3.6.1>
[14] 
http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 
<http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811>

—
Dr Peter M. Kelly
pmke...@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Reply via email to