tuxji opened a new pull request #422:
URL: https://github.com/apache/incubator-daffodil/pull/422
This pull request adds a new runtime2 backend to Daffodil. This pull
request is only a baby step which can handle only 32-bit big-endian
binary integers, but it implements enough functionality to serve as a
start for discussion and collaboration. This pull request's
integration branch is a work in progress and will be rebased over time
to keep up with the master branch so you will have to run the
following commands every time you want to pull new changes into your
checkout of the integration branch:
git pull --rebase
git diff ORIG_HEAD..HEAD
Please use TestRuntime2.dfdl.xsd and TestRuntime2.tdml as a guide of
how to write a small DFDL schema containing some xs:int elements and
run daffodil on a corresponding TDML file to parse and unparse some
binary test data using the runtime2 backend:
```bash
$ daffodil test
daffodil-runtime2/src/test/resources/org/apache/daffodil/runtime2/TestRuntime2.tdml
Creating DFDL Test Suite for
daffodil-runtime2/src/test/resources/org/apache/daffodil/runtime2/TestRuntime2.tdml
[Pass] parse_int32
[Pass] unparse_int32
Total: 2, Pass: 2, Fail: 0, Not Found: 0
```
The runtime2 backend will generate C code from your DFDL schema in a
temporary directory (/tmp/NNNNNNNNNNNN with NNN... all numbers), build
a C-based executable, run the TDML tests through the executable, and
check the results. You will be able to take the C code and use it as
a parsing/unparsing library in an embedded device with limited memory
and power. Our goal is to implement the smallest possible conforming
subset of DFDL as described in the "Runtime 2 Design" table of
<https://cwiki.apache.org/confluence/display/DAFFODIL/WIP%3A+Daffodil+Runtime+2>.
Note that you will need a C11 or C18 compiler ("cc") and a tiny XML
library called Mini-XML ("mxml.h" and "libmxml.a") to build Daffodil
on this branch and run the TDML tests. Many systems have both of
these available as installable binary packages with names like "gcc",
"libmxml-dev", etc. If there is enough request, we can bundle the
Mini-XML library sources with Daffodil to make it easier to build the
runtime2 backend on systems that don't have an installable
"libmxml-dev" package. We also have considered making runtime2 (and a
possible future runtime3 targeting programmable hardware logic)
optional parts that can be distributed separately and plugged into
Daffodil on demand (for modularity, not for legal purposes; we want
all Daffodil code to be covered by the Apache 2 license).
DAFFODIL-2202
______________________________________________________________________
Questions & loose ends:
1. I just found out that the os-lib library author has stopped
publishing Scala 2.11 builds since March 2019. I hadn't known that
until I enabled Daffodil's GitHub Actions CI workflow in my fork and
saw the compilation problems. I've opened a GitHub issue asking the
os-lib author if he would consider publishing Scala 2.11 builds again
for a period of time until Daffodil stops supporting Scala 2.11, but
if he says no, should I replace all my calls of the os-lib library or
make Daffodil start supporting Scala 2.13 and stop supporting 2.11
like the os-lib author has done?
2. Search for "TODO" in the changelog below to find some more loose
ends and questions:
a. In daffodil-core's SequenceCombinator.scala, generateCode needs
to support generating code for each child in a sequence with multiple
children.
b. In daffodil-core's
runtime2/generators/BinaryNumberParserGenerator.scala, generateCode
needs to generate code that parses and unparses binary numbers as
securely as possible. We may need to develop some C functions that
are more secure than what the current code uses.
c. In daffodil-core's runtime2/generators/ParserGenerator.scala,
defineQNameInit doesn't handle multiple xmls=ns declarations yet.
d. Do we need a "runtime" tunable as well as a "tdmlImplementation"
tunable within Daffodil? The "tdmlImplementation" tunable allows TDML
tests to use runtime2 instead of runtime1. The "runtime" tunable
could allow "daffodil parse ..." and "daffodil unparse ..." commands
to use runtime2 instead of runtime1 too, but that seems redundant
because the runtime2 backend dynamically generates and builds an
executable from C code and calls the C-based executable with similar
"daffodil parse ..." and "daffodil unparse ..." command lines as well.
I think it makes more sense to add a new "daffodil generate ..."
command which will generate the C code and executable in your own
directory so you can do whatever you want with the C code or
executable afterwards.
e. Should we define a pluggable CodeGeneratorState interface for
runtime2 to implement in daffodil-runtime1's DFDLParserUnparser.scala?
We may want to wait for a future runtime3 in order to make the common
interface clearer.
f. In daffodil-runtime2's stack.{c,h}, we may need to use
heap-allocated storage rather than statically allocated storage even
though we use the stack only to run TDML tests.
3. The generated_code.{c.h} files checked in here were originally
mockups made in Emacs & Visual Studio Code for design and debugging
purposes. Depending on whether your DFDL schema is the same or
different, runtime2 will generate a pair of generated_code.{c.h} files
in a temporary directory with the same or different contents than
these mockup files. We are checking in these files only to make it
easier to continue changing and debugging the mockup code in Visual
Code Studio.
______________________________________________________________________
ChangeLog:
In .github/workflows/main.yml, install MSYS2 environment to give us a
C compiler on Windows. Install mxml library on both Linux and Windows
so we can build/link our C code with it.
In build.sbt, build daffodil-runtime2 module with CcPlugin configured
to compile the runtime2 C source files into a "libruntime2.a" static
library.
In daffodil-cli/bin.NOTICE, fix an attribution notice.
In daffodil-cli/build.sbt, configure the Universal plugin to include
the runtime2 C header files and "libruntime2.a" library in an
installed daffodil so that an installed daffodil can use the runtime2
backend. See also the code in GeneratedCodeCompiler which looks for
the runtime2 C header and "libruntime2.a" library in either an
installed daffodil location or the daffodil source tree depending on
where and how the code is executed. We are using a "simplest design
that can work" approach until future requirements become clearer.
In daffodil-core's Compiler.scala, add a ProcessorFactory
generateCode method to call generateCode on a root document and
return a fully populated CodeGeneratorState object containing
generated C code.
In daffodil-core's ElementDeclGrammarMixin.scala, add a
RootGrammarMixin generateCode method to call generateCode on its
document element.
In daffodil-core's Grammar.scala, add a SeqComp generateCode method to
call generateCode on its children.
In daffodil-core's GrammarTerm.scala, give abstract class Gram a
GramRuntime2Mixin trait as well as a GramRuntime1Mixin trait.
In daffodil-core's Production.scala, add a Prod generateCode method to
call generateCode on its gram object.
In daffodil-core's ElementCombinator.scala, add an ElementCombinator
generateCode method to call generateCode on its subComb object, add
empty CaptureContentLengthStart, CaptureContentLengthEnd,
CaptureValueLengthStart, and CaptureValueLengthEnd generateCode
methods, and add an ElementParseAndUnspecifiedLength generateCode
method to instantiate and call generateCode on an
ElementParserGenerator.
In daffodil-core's PrimitivesBinaryNumber.scala, add a
BinaryIntegerKnownLength generateCode method to call generateCode on
its generator object.
In daffodil-core's SequenceChild.scala, add a
ScalarOrderedSequenceChild generateCode method to call generateCode on
its term's termContentBody object.
In daffodil-core's SequenceCombinator.scala, add an OrderedSequence
generateCode method to call generateCode on a single child of a
sequence (TODO: need to support generating code for each child in a
sequence with multiple children).
In daffodil-core's SpecifiedLength.scala, add a
SpecifiedLengthImplicit generateCode method to call generateCode on
its eGram object.
In daffodil-core's runtime2/GeneratedCodeCompiler.scala, implement a
GeneratedCodeCompiler compile method to find the runtime2 C header
files and "libruntime2.a" library, write the generated C code to a
temporary directory, run the C compiler, capture any compilation
diagnostics, and add them to its ProcessorFactory's diagnostics. Also
implement a GeneratedCodeCompiler dataProcessor method to return a
Runtime2DataProcessor object with the path of the executable that was
just compiled.
In daffodil-core's runtime2/GramRuntime2Mixin.scala, implement a
GramRuntime2Mixin trait with a generateCode method which throws a SDE
if a subclass doesn't implement the generateCode method.
In daffodil-core's runtime2/Runtime2DataProcessor.scala, implement a
Runtime2DataProcessor class which extends/implements
DFDL.DataProcessorBase while adding its own new parse and unparse
methods. The parse method writes the input file to a temporary
directory, runs the executable in that directory telling to parse the
input file and write an output file, creates a ParseResult object with
the path of the output file, adds any runtime errors to the
ParseResult's diagnostics, and returns the ParseResult object. The
unparse method also writes the input file to a temporary directory,
runs the executable in that directory telling it to unparse the input
file and write an output file, creates an UnparseResult object with
the path of the output file, adds any runtime errors to the
UnparseResult's diagnostics, and returns the UnparseResult object.
Implement a Runtime2DataLocation object with all zero fields since we
can't track the executable's read position anyway. Implement both
ParseResult and UnparseResult classes using that Runtime2DataLocation
object. Make ParseResult load the output file and return its XML
data. Make UnparseResult save and return the unparse output file's
length as its finalBitPos0b field for roundtrip processing by TDML
tests.
In daffodil-core's
runtime2/generators/BinaryNumberParserGenerator.scala, implement a
BinaryIntegerKnownLengthParserGenerator class with a generateCode
method that generates the C code needed to initialize, parse, and
unparse 32-bit integer fields. Initialize the field to the memory bit
pattern 0xCDCDCDCD since I'd already had to fix a bug that was leaving
fields uninitialized; this distinctive bit pattern should make such
bugs more obvious. TODO: Make the generated C code as secure as
possible using Language-Theoretic Security functions if possible.
In daffodil-core's runtime2/generators/ElementParserGenerator.scala,
implement a ElementParserGenerator class with a generateCode method
that makes the CodeGeneratorState calls needed for both complex and
simple elements.
In daffodil-core's runtime2/generators/ParserGenerator.scala,
implement a ParseGenerator trait with a generateCode method and
implement a CodeGeneratorState class with many methods to generate and
accumulate strings of generated C code. Implement a ComplexCGState
class to accumulate strings of generated C code for nested elements
inside complex elements. TODO: In defineQNameInit, we try to optimize
away a single extraneous xmlns=ns declaration in a child element when
its parent has the same xmlns=ns declaration, but our approach doesn't
handle multiple xmls=ns declarations and has not been tested on corner
cases yet.
In daffodil-core's runtime2/TestGeneratedCodeCompiler.scala, write
methods to test GeneratedCodeCompiler's compile method and
Runtime2DataProcessor's parse and unparse methods.
In daffodil-core's tdml.xsd, add "daffodil-runtime2" as a new TDML
implementation enumeration as well as "daffodil" and "ibm".
In daffodil-propgen's dafext.xsd, add "tdmlImplementation" as a new
tunable with default value "daffodil". See also the code in
daffodil-tdml-lib's TDMLRunner.scala which instantiates three
different TDMLDFDLProcessorFactory implementations depending on the
tdmlImplementation tunable's value ("daffodil", "daffodil-runtime2",
or "ibm"). Also add "runtime" as a new tunable with default value
"runtime1" and allowed value "runtime2". TODO: Need to define usage
for this "runtime" tunable and implement its usage in the rest of
daffodil. Does it make sense to use a tunable set to "runtime2"
instead of "runtime1" when running "daffodil parse ..." or "daffodil
unparse ..." from the command line? Dynamically generating, building,
and running a C-based executable in runtime2 may not speed up these
commands very much compared to runtime1's Scala code. Perhaps we
should add a new "daffodil generate ..." subcommand which will
generate C code from a given DFDL schema so you can use that C code to
build your own application.
In daffodil-runtime1's DFDLParserUnparser.scala, split the original
DataProcessor trait into a DataProcessorBase trait without the
WithDiagnostics trait or parse/unparse methods and a DataProcessor
trait extending DataProcessorBase and adding the WithDiagnostics trait
along with parse/unparse methods. The reason is to allow
Runtime2DataProcessor to extend DataProcessorBase and add its own
parse/unparse methods with different parameters and return types
without having to implement WithDiagnostics. Also add a
CodeGeneratorState trait with no methods which will be extended by
runtime2's CodeGeneratorState class in case we need to modularize
runtime2 for pluggability. TODO: Should we make runtime2 pluggable?
In daffodil-runtime2's .clang-format, define the C coding style to be
used when formatting the runtime2 C files. We are using the Barr
Group's Embedded C style recommendations:
- braces on their own lines, BSD/Allman style
- indent 4 spaces (no tab characters)
- align decl names on first char
- put function definition names in first column
Note we also run include-what-you-use (iwyu) on the runtime2 C files
to make sure each file has all the #includes for everything it uses
while removing any extraneous #includes.
In daffodil-runtime2's .vscode/launch.json and tasks.json, tell Visual
Studio Code how to compile and debug the runtime2 C files (used only
to make development/editing of these files easier).
In daffodil-runtime2's common_runtime.{c,h}, implement a walkInfoset
method to walk a runtime2 infoset while calling VisitEventHandler
methods, and define runtime2 common types and structs such as
NamedQName, TypeCode, ElementRuntimeData, InfosetBase, PState, UState,
and VisitEventHandler.
In daffodil-runtime2's daffodil_argp.{c,h}, implement all the code
needed to support the runtime2 executable's "daffodil parse" and
"daffodil unparse" command line interface arguments (following
daffodil's Scala CLI syntax as closely as possible).
In daffodil-runtime2's daffodil_main.c, implement the runtime2
executable's main method which doesn't need to know anything about the
generated C code except how to initialize it by calling a
rootInfoset() method. The only C files which need to be generated by
runtime2 are the two files "generated_code.h" and "generated_code.c".
In daffodil-runtime2's generated_code.{c,h}, please note the files
checked in here were originally mockups made in Visual Studio Code for
design and debugging purposes. Depending on whether your DFDL schema
is the same or different, runtime2 will generate a pair of
generated_code.{c.h} files in a temporary directory with the same or
different contents than these mockup files. We are checking in these
files only to make it easier to continue changing and debugging the
generated code in Visual Code Studio.
In daffodil-runtime2's stack.{c,h}, implement a stack used by
xml_writer.c to build an XML document while traversing the in-memory
infoset. Use statically allocated storage, not heap allocated
storage, as an exercise in case we might need to use stack.c in
another part of the runtime2 C code on an embedded device with limited
memory. However, the Mini-XML library requires heap allocated storage
anyway so switch stack.c to heap allocated storage later (TODO) if it
turns out we use mxml and stack only for TDML tests.
In daffodil-runtime2's xml_reader.{c,h}, implement VisitEventHandler
methods to walk an runtime2 infoset and use XML data from an input
file to initialize the in-memory infoset.
In daffodil-runtime2's xml_writer.{c,h}, implement VisitEventHandler
methods to walk an runtime2 infoset, push nested XML nodes on a stack,
and write the complete XML data to an output file when the walk is
complete. The stack has a statically defined maximum depth of 100
nested nodes right now which probably will be changed later.
In daffodil-runtime2's TestRuntime2.dfdl.xsd, define an example DFDL
schema to be used for tests. Right now the schema has only one
top-level complex element containing three xs:int simple elements to
be parsed and unparsed.
In daffodil-runtime2's TestRuntime2.tdml, define a suite of TDML tests
with both config-runtime1 and config-runtime2 configurations and a
defaultConfig that selects one of them so you can run the TDML tests
with either runtime1 or runtime2. Right now we have only two
"parse_int32" and "unparse_int32" tests with corresponding
"parse_int32" and "unparse_int32" files to be parsed/unparsed.
In daffodil-runtime2's TestRuntime2.scala, define a TDML runner to run
the runtime2 suite of TDML tests from the "sbt test" command line.
In daffodil-tdml-lib's TDMLRunner.scala, extend the TestCase
tdmlDFDLProcessorFactory method to allow three different
TDMLDFDLProcessorFactory implementations to be used depending on the
corresponding value of the tdmlImplementation tunable ("daffodil",
"daffodil-runtime2", or "ibm"). Fix a typo in UnparseTestCase's
roundtrip error message.
In daffodil-tdml-processor's Runtime2TDMLDFDLProcessor.scala,
implement a TDMLDFDLProcessorFactory class with implementationName
"daffodil-runtime2" and a getProcessor method which runs
GeneratedCodeCompiler's compile method and returns a
Runtime2TDMLDFDLProcessor ready to run the executable. Implement a
Runtime2TDMLDFDLProcessor class with parse and unparse methods which
run Runtime2DataProcessor's parse and unparse methods and return
Runtime2TDMLParseResult and Runtime2TDMLUnparseResult objects.
Implement Runtime2TDMLParseResult and Runtime2TDMLUnparseResult as
wrapper classes around runtime2's ParseResult and UnparseResult
classes.
In projects/Dependencies.scala, add a com.lihaoyi %% os-lib dependency
to let GeneratedCodeCompiler and Runtime2DataProcessor create the
files they need to write or read and call the os commands they need to
compile the C code and run the executable.
In project/Rat.scala, fix a typo.
In project/plugins.sbt, make sbt use com.github.tnakamot % sbt-cc as
one of its plugins.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]