Thanks Peter, On Tue, Jun 16, 2015 at 2:16 AM, Peter Kelly <pmke...@apache.org> wrote:
> For the past year or so I’ve been playing around with the idea of > developing a programming language specifically for expressing translations > between different formal languages (in the programming language sense, not > natural languages). > > A document format used for word processing/other office-style applications > can be considered a formal language, where it is possible to specify its > structure in some formal notation, such as BNF. ODF and OOXML both fit this > criteria, as they use XML and their tree structure is defined by the > RelaxNG schemas that are part of the specifications - there are copies of > these in the ‘schemas’ directory in the Corinthia repository. The > zip-packaging complicates this a little, but it would be feasible to find a > way to express this formally in terms of what files are required in a > package, how they relate to one another, and what schema each must conform > to. > > I believe that ODF, OOXML, and HTML are not the only file formats we > should consider supporting in Corinthia - there are numerous others, > including: > > - Markdown [1] / MultiMarkdown [2] / CommonMark [3] > - reStruturedText [4] (popular in the Python world) > - AsciiDoc [5] > - RTF [6] > - LaTeX [7] > - Various Wiki markup languages > > All of these languages are difficult to parse manually, and really need to > be tackled through the use of a parser generator or other language > processing toolkit. > > Let’s face it, C sucks for what we’re doing, as it’s too low level and we > constantly have to take care of manual memory management, awkward string > handling, and protecting against crashes that can so often arise due to > NULL pointer dereferences, memory corruption, use-after-free and other > bugs. The Objective C version of the code was better, but still not ideal, > and not cross platform. > > One programming language I’ve been especially inspired by in some past > work I’ve done is Stratego/XT [8]. It includes the following features: > > - A grammar specification language (SDF), which can be used as input to > parser generators, which will automatically construct abstract syntax > trees, expressed in the ATerm format (which is similar to Lisp’s > S-expressions). > > - The Stratego language, which allows specification of rewrite rules based > on pattern matching, in a notation similar to that used to express the > denotational semantics of programming languages. The language also includes > the ability to express strategies for how rewrite rules should be applied, > such as pre-order or post-order traversal of the syntax tree, repeated > attempts at rule application, and so forth; these strategies can be > composed in arbitrary ways. The language is also a powerful functional > programming language in it’s own right, and has many other features for > analysis of syntax trees to determine things like variable scopes. > > - A pretty-printing language for concisely describing how an abstract > syntax tree can be turned back into a text file in a given language, for > later compilation or editing. > > Sadly, Stratego/XT itself seems to have fallen into a state of disrepair, > and many of the links on the website are broken (but can probably be dug up > with a bit of work). Their publications page ( > http://strategoxt.org/Stratego/StrategoPublications < > http://strategoxt.org/Stratego/StrategoPublications>) has plenty of good > info though. They’ve now moved on to what is called the Spoofax Language > Workbench which, even more sadly, depends on both Java and Eclipse. I’m > sure it’s possible to get the core language operating standalone, but I > haven’t looked into it. > > Anyway, Stratego fundamentally changed the way I thought about the process > of writing compilers, and opened my eyes to a whole world of possibilities > in program transformation. I've found it useful in several different > contexts, such as instrumenting code for performance measurement, and > detecting plagiarism in student code submissions for CS assignments. > > Another important development in the last decade or so has been Parsing > Expression Grammars (PEGs), and packrat parsing [10]. The two key papers > you want to read on these are [11] and [12], although Bryan Ford and others > have published many other papers on the technique. PEGs are context-aware > grammars that combine both lexical and syntactic analysis (unlike the > separation seen in flex & bison). They are readily composable, and avoid > many of the ambiguity problems inherent with context-free grammars, such as > the “dangling else” problem. A good example of a PEG-based parser > implementation is Grako [13], which is implemented in Python. > > Having seen these various things, and thinking about how to make the > process of writing document format translators easier, I’m of the view that > it would be immensely advantageous for us to have these concepts combined > into a single language/runtime library. In principle, this could be > developed independently of Corinthia, but I think would be better done as > part of the project as it’s something which we have a very direct > application for, namely writing filters for the above-mentioned file > formats, and possibly porting the OOXML filter and ODF filter across to as > well. It’s also been several years since I’ve had a chance to sink my teeth > into some real programming language research, and I’m keen to get back to > my roots. > > I’ve done about seven or eight prototypes so far, in both C and more > recently Python - so far addressing only the parsing and syntax tree > construction aspects, without any thought yet as to the language itself. > I’m currently spending a lot of time with the famous “Dragon Book”, > reviewing the chapters on parsing and syntax-directed translation, and > implementing the algorithms described in the text to solidify my > understanding of how they work. I can share the code if you’re interested, > though right now it’s all completely undocumented and nowhere near ready > for use by others, and you’ll probably find either the dragon book itself > or other online tutorials to be more useful. I should emphasis that it’s > all throwaway code, and I intend on doing a “proper” implementation from > scratch within the Corinthia repository. > > I’ve pretty much figured out what needs to be done for the parsing and > syntax tree construction, and the next step is coming up with a suitable > runtime representation, that can be used both directly from C code, and > from the “Flat” language (whatever form it eventually takes), and > potentially bindings to other languages like Python and JavaScript. What I > envisage here is an approach in which all such data structures are > immutable, structured using ATerms (which Stratego uses, and are like > S-expressions), and have transformations expressed in a purely functional > style, with Stratego-style pattern matching. > > Our current runtime representation of data obtained from source files is > XML, since this is what the currently-supported formats are based on. We > could either stick with this, translating to and from the ATerm-style > syntax where appropriate, or (and I would argue this is better), to use a > common type system for all transformations. The advantage of the latter is > that all transformations could be subject to type checking, which means the > Flat compiler would statically guarantee that the output produced by e.g. a > HTML to ODF translation fully conforms to the ODF schema, the same way that > a C/Java compiler does type checking. This would allow us to have > formally-verified translations between file formats. > > The transformations would not necessarily have to use code generation; we > could alternatively implement them via an interpreter. Both could be > supported, with back-ends that output in C code, or bytecode that is > understood by the interpreter. In either case, there will need to be > runtime support consisting of a garbage collector and likely an exception > handling mechanism; a debugger and a Python-style REPL interface would also > be handy for development purposes. > > As an example of the type of grammar I’m thinking of, see the following > one for MultiMarkdown (but imagine it without the semantic actions, which > are written in C): > > https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg < > https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg> > > I still have a lot of reading to do on many topics, and I’ve been meaning > to start discussing it here on the list for a long time but have yet to get > around to it. Sorry for dumping so much stuff at once (and I realise it’s > pretty heavy), but at least now we can start to get the discussion going > about what we want in a transformation language, and how we can make the > development of new filters easier. > Head exploded now :-) Whilst a lot of the detail went over my head I have been working with ANTLR for the past few years and found 11 and 12 interesting. But the heavy maths leaves me high and dry. Notionally though I do think we should/could express the mapping between elements in such a way that the translations could be generated, without missing cases. Or at least unknowingly missing cases. My work in my day job with the compiler I work on was the feeder for the tool I developed to track how much of a schema is used and how. Which seems to me to be very similar problem. Instead of just tracking we map or generate a new representation. > > [1] http://daringfireball.net/projects/markdown/syntax < > http://daringfireball.net/projects/markdown/syntax> > [2] http://fletcherpenney.net/multimarkdown/ < > http://fletcherpenney.net/multimarkdown/> > [3] http://commonmark.org <http://commonmark.org/> > [4] http://docutils.sourceforge.net/rst.html < > http://docutils.sourceforge.net/rst.html> > [5] http://www.methods.co.nz/asciidoc/ <http://www.methods.co.nz/asciidoc/ > > > [6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 < > http://www.microsoft.com/en-US/Download/details.aspx?id=10725> > [7] http://www.latex-project.org <http://www.latex-project.org/> > [8] http://strategoxt.org <http://strategoxt.org/> > [9] http://strategoxt.org/Spoofax/WebHome < > http://strategoxt.org/Spoofax/WebHome> > [10] http://bford.info/packrat/ <http://bford.info/packrat/> > [11] http://bford.info/pub/lang/packrat-icfp02.pdf < > http://bford.info/pub/lang/packrat-icfp02.pdf> > [12] http://bford.info/pub/lang/peg.pdf < > http://bford.info/pub/lang/peg.pdf> > [13] https://pypi.python.org/pypi/grako/3.6.1 < > https://pypi.python.org/pypi/grako/3.6.1> > [14] > http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 > < > http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 > > > > — > Dr Peter M. Kelly > pmke...@apache.org > > PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key> > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966) > > -- Cheers, Ian C