Hi I like the idea about formal language translation, it is commonly used between compilers (I used the principle years ago writing a basic->C cross compiler).
However I have a feeling that the exceptions (maybe not the right word) in OOXML and ODF makes a formal language converter next to impossible. F.x. converting tables from ODF to OOXML would be a piece of cake, but going from OOXML to ODF you first need to detect if it is a table at all or just styled paragraphs. I do not want to sound negative, but I do see theoritical problems because the formats can be used in so many different ways (unlike the compiler case, where the compiler itself is the ultimate judge). rgds jan i On Tuesday, June 16, 2015, Ian C <i...@amham.net> wrote: > Thanks Peter, > > On Tue, Jun 16, 2015 at 2:16 AM, Peter Kelly <pmke...@apache.org > <javascript:;>> wrote: > > > For the past year or so I’ve been playing around with the idea of > > developing a programming language specifically for expressing > translations > > between different formal languages (in the programming language sense, > not > > natural languages). > > > > A document format used for word processing/other office-style > applications > > can be considered a formal language, where it is possible to specify its > > structure in some formal notation, such as BNF. ODF and OOXML both fit > this > > criteria, as they use XML and their tree structure is defined by the > > RelaxNG schemas that are part of the specifications - there are copies of > > these in the ‘schemas’ directory in the Corinthia repository. The > > zip-packaging complicates this a little, but it would be feasible to > find a > > way to express this formally in terms of what files are required in a > > package, how they relate to one another, and what schema each must > conform > > to. > > > > I believe that ODF, OOXML, and HTML are not the only file formats we > > should consider supporting in Corinthia - there are numerous others, > > including: > > > > - Markdown [1] / MultiMarkdown [2] / CommonMark [3] > > - reStruturedText [4] (popular in the Python world) > > - AsciiDoc [5] > > - RTF [6] > > - LaTeX [7] > > - Various Wiki markup languages > > > > All of these languages are difficult to parse manually, and really need > to > > be tackled through the use of a parser generator or other language > > processing toolkit. > > > > Let’s face it, C sucks for what we’re doing, as it’s too low level and we > > constantly have to take care of manual memory management, awkward string > > handling, and protecting against crashes that can so often arise due to > > NULL pointer dereferences, memory corruption, use-after-free and other > > bugs. The Objective C version of the code was better, but still not > ideal, > > and not cross platform. > > > > One programming language I’ve been especially inspired by in some past > > work I’ve done is Stratego/XT [8]. It includes the following features: > > > > - A grammar specification language (SDF), which can be used as input to > > parser generators, which will automatically construct abstract syntax > > trees, expressed in the ATerm format (which is similar to Lisp’s > > S-expressions). > > > > - The Stratego language, which allows specification of rewrite rules > based > > on pattern matching, in a notation similar to that used to express the > > denotational semantics of programming languages. The language also > includes > > the ability to express strategies for how rewrite rules should be > applied, > > such as pre-order or post-order traversal of the syntax tree, repeated > > attempts at rule application, and so forth; these strategies can be > > composed in arbitrary ways. The language is also a powerful functional > > programming language in it’s own right, and has many other features for > > analysis of syntax trees to determine things like variable scopes. > > > > - A pretty-printing language for concisely describing how an abstract > > syntax tree can be turned back into a text file in a given language, for > > later compilation or editing. > > > > Sadly, Stratego/XT itself seems to have fallen into a state of disrepair, > > and many of the links on the website are broken (but can probably be dug > up > > with a bit of work). Their publications page ( > > http://strategoxt.org/Stratego/StrategoPublications < > > http://strategoxt.org/Stratego/StrategoPublications>) has plenty of good > > info though. They’ve now moved on to what is called the Spoofax Language > > Workbench which, even more sadly, depends on both Java and Eclipse. I’m > > sure it’s possible to get the core language operating standalone, but I > > haven’t looked into it. > > > > Anyway, Stratego fundamentally changed the way I thought about the > process > > of writing compilers, and opened my eyes to a whole world of > possibilities > > in program transformation. I've found it useful in several different > > contexts, such as instrumenting code for performance measurement, and > > detecting plagiarism in student code submissions for CS assignments. > > > > Another important development in the last decade or so has been Parsing > > Expression Grammars (PEGs), and packrat parsing [10]. The two key papers > > you want to read on these are [11] and [12], although Bryan Ford and > others > > have published many other papers on the technique. PEGs are context-aware > > grammars that combine both lexical and syntactic analysis (unlike the > > separation seen in flex & bison). They are readily composable, and avoid > > many of the ambiguity problems inherent with context-free grammars, such > as > > the “dangling else” problem. A good example of a PEG-based parser > > implementation is Grako [13], which is implemented in Python. > > > > Having seen these various things, and thinking about how to make the > > process of writing document format translators easier, I’m of the view > that > > it would be immensely advantageous for us to have these concepts combined > > into a single language/runtime library. In principle, this could be > > developed independently of Corinthia, but I think would be better done as > > part of the project as it’s something which we have a very direct > > application for, namely writing filters for the above-mentioned file > > formats, and possibly porting the OOXML filter and ODF filter across to > as > > well. It’s also been several years since I’ve had a chance to sink my > teeth > > into some real programming language research, and I’m keen to get back to > > my roots. > > > > I’ve done about seven or eight prototypes so far, in both C and more > > recently Python - so far addressing only the parsing and syntax tree > > construction aspects, without any thought yet as to the language itself. > > I’m currently spending a lot of time with the famous “Dragon Book”, > > reviewing the chapters on parsing and syntax-directed translation, and > > implementing the algorithms described in the text to solidify my > > understanding of how they work. I can share the code if you’re > interested, > > though right now it’s all completely undocumented and nowhere near ready > > for use by others, and you’ll probably find either the dragon book itself > > or other online tutorials to be more useful. I should emphasis that it’s > > all throwaway code, and I intend on doing a “proper” implementation from > > scratch within the Corinthia repository. > > > > I’ve pretty much figured out what needs to be done for the parsing and > > syntax tree construction, and the next step is coming up with a suitable > > runtime representation, that can be used both directly from C code, and > > from the “Flat” language (whatever form it eventually takes), and > > potentially bindings to other languages like Python and JavaScript. What > I > > envisage here is an approach in which all such data structures are > > immutable, structured using ATerms (which Stratego uses, and are like > > S-expressions), and have transformations expressed in a purely functional > > style, with Stratego-style pattern matching. > > > > Our current runtime representation of data obtained from source files is > > XML, since this is what the currently-supported formats are based on. We > > could either stick with this, translating to and from the ATerm-style > > syntax where appropriate, or (and I would argue this is better), to use a > > common type system for all transformations. The advantage of the latter > is > > that all transformations could be subject to type checking, which means > the > > Flat compiler would statically guarantee that the output produced by > e.g. a > > HTML to ODF translation fully conforms to the ODF schema, the same way > that > > a C/Java compiler does type checking. This would allow us to have > > formally-verified translations between file formats. > > > > The transformations would not necessarily have to use code generation; we > > could alternatively implement them via an interpreter. Both could be > > supported, with back-ends that output in C code, or bytecode that is > > understood by the interpreter. In either case, there will need to be > > runtime support consisting of a garbage collector and likely an exception > > handling mechanism; a debugger and a Python-style REPL interface would > also > > be handy for development purposes. > > > > As an example of the type of grammar I’m thinking of, see the following > > one for MultiMarkdown (but imagine it without the semantic actions, which > > are written in C): > > > > https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg < > > https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg> > > > > I still have a lot of reading to do on many topics, and I’ve been meaning > > to start discussing it here on the list for a long time but have yet to > get > > around to it. Sorry for dumping so much stuff at once (and I realise it’s > > pretty heavy), but at least now we can start to get the discussion going > > about what we want in a transformation language, and how we can make the > > development of new filters easier. > > > > Head exploded now :-) Whilst a lot of the detail went over my head I have > been working with ANTLR for the past few years and found 11 and 12 > interesting. But the heavy maths leaves me high and dry. Notionally though > I do think we should/could express the mapping between elements in such a > way that the translations could be generated, without missing cases. Or at > least unknowingly missing cases. > > My work in my day job with the compiler I work on was the feeder for the > tool I developed to track how much of a schema is used and how. Which seems > to me to be very similar problem. Instead of just tracking we map or > generate a new representation. > > > > > [1] http://daringfireball.net/projects/markdown/syntax < > > http://daringfireball.net/projects/markdown/syntax> > > [2] http://fletcherpenney.net/multimarkdown/ < > > http://fletcherpenney.net/multimarkdown/> > > [3] http://commonmark.org <http://commonmark.org/> > > [4] http://docutils.sourceforge.net/rst.html < > > http://docutils.sourceforge.net/rst.html> > > [5] http://www.methods.co.nz/asciidoc/ < > http://www.methods.co.nz/asciidoc/ > > > > > [6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 < > > http://www.microsoft.com/en-US/Download/details.aspx?id=10725> > > [7] http://www.latex-project.org <http://www.latex-project.org/> > > [8] http://strategoxt.org <http://strategoxt.org/> > > [9] http://strategoxt.org/Spoofax/WebHome < > > http://strategoxt.org/Spoofax/WebHome> > > [10] http://bford.info/packrat/ <http://bford.info/packrat/> > > [11] http://bford.info/pub/lang/packrat-icfp02.pdf < > > http://bford.info/pub/lang/packrat-icfp02.pdf> > > [12] http://bford.info/pub/lang/peg.pdf < > > http://bford.info/pub/lang/peg.pdf> > > [13] https://pypi.python.org/pypi/grako/3.6.1 < > > https://pypi.python.org/pypi/grako/3.6.1> > > [14] > > > http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 > > < > > > http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 > > > > > > > — > > Dr Peter M. Kelly > > pmke...@apache.org <javascript:;> > > > > PGP key: http://www.kellypmk.net/pgp-key < > http://www.kellypmk.net/pgp-key> > > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966) > > > > > -- > Cheers, > > Ian C > -- Sent from My iPad, sorry for any misspellings.