[GitHub] [incubator-daffodil] tuxji opened a new pull request #422: WIP: Add runtime2 backend with C code generator

GitBox Mon, 28 Sep 2020 14:12:17 -0700


tuxji opened a new pull request #422:
URL: https://github.com/apache/incubator-daffodil/pull/422



   This pull request adds a new runtime2 backend to Daffodil.  This pull
   request is only a baby step which can handle only 32-bit big-endian
   binary integers, but it implements enough functionality to serve as a
   start for discussion and collaboration.  This pull request's
   integration branch is a work in progress and will be rebased over time
   to keep up with the master branch so you will have to run the
   following commands every time you want to pull new changes into your
   checkout of the integration branch:
   
       git pull --rebase
       git diff ORIG_HEAD..HEAD
   
   Please use TestRuntime2.dfdl.xsd and TestRuntime2.tdml as a guide of
   how to write a small DFDL schema containing some xs:int elements and
   run daffodil on a corresponding TDML file to parse and unparse some
   binary test data using the runtime2 backend:
   
   ```bash
   $ daffodil test 
daffodil-runtime2/src/test/resources/org/apache/daffodil/runtime2/TestRuntime2.tdml
   Creating DFDL Test Suite for 
daffodil-runtime2/src/test/resources/org/apache/daffodil/runtime2/TestRuntime2.tdml
   [Pass] parse_int32
   [Pass] unparse_int32
   
   Total: 2, Pass: 2, Fail: 0, Not Found: 0
   ```
   
   The runtime2 backend will generate C code from your DFDL schema in a
   temporary directory (/tmp/NNNNNNNNNNNN with NNN... all numbers), build
   a C-based executable, run the TDML tests through the executable, and
   check the results.  You will be able to take the C code and use it as
   a parsing/unparsing library in an embedded device with limited memory
   and power.  Our goal is to implement the smallest possible conforming
   subset of DFDL as described in the "Runtime 2 Design" table of
   
<https://cwiki.apache.org/confluence/display/DAFFODIL/WIP%3A+Daffodil+Runtime+2>.
   
   Note that you will need a C11 or C18 compiler ("cc") and a tiny XML
   library called Mini-XML ("mxml.h" and "libmxml.a") to build Daffodil
   on this branch and run the TDML tests.  Many systems have both of
   these available as installable binary packages with names like "gcc",
   "libmxml-dev", etc.  If there is enough request, we can bundle the
   Mini-XML library sources with Daffodil to make it easier to build the
   runtime2 backend on systems that don't have an installable
   "libmxml-dev" package.  We also have considered making runtime2 (and a
   possible future runtime3 targeting programmable hardware logic)
   optional parts that can be distributed separately and plugged into
   Daffodil on demand (for modularity, not for legal purposes; we want
   all Daffodil code to be covered by the Apache 2 license).
   
   DAFFODIL-2202
   ______________________________________________________________________
   
   Questions & loose ends:
   
   1. I just found out that the os-lib library author has stopped
   publishing Scala 2.11 builds since March 2019.  I hadn't known that
   until I enabled Daffodil's GitHub Actions CI workflow in my fork and
   saw the compilation problems.  I've opened a GitHub issue asking the
   os-lib author if he would consider publishing Scala 2.11 builds again
   for a period of time until Daffodil stops supporting Scala 2.11, but
   if he says no, should I replace all my calls of the os-lib library or
   make Daffodil start supporting Scala 2.13 and stop supporting 2.11
   like the os-lib author has done?
   
   2. Search for "TODO" in the changelog below to find some more loose
   ends and questions:
   
      a. In daffodil-core's SequenceCombinator.scala, generateCode needs
   to support generating code for each child in a sequence with multiple
   children.
   
      b. In daffodil-core's
   runtime2/generators/BinaryNumberParserGenerator.scala, generateCode
   needs to generate code that parses and unparses binary numbers as
   securely as possible.  We may need to develop some C functions that
   are more secure than what the current code uses.
   
      c. In daffodil-core's runtime2/generators/ParserGenerator.scala,
   defineQNameInit doesn't handle multiple xmls=ns declarations yet.
   
      d. Do we need a "runtime" tunable as well as a "tdmlImplementation"
   tunable within Daffodil?  The "tdmlImplementation" tunable allows TDML
   tests to use runtime2 instead of runtime1.  The "runtime" tunable
   could allow "daffodil parse ..." and "daffodil unparse ..." commands
   to use runtime2 instead of runtime1 too, but that seems redundant
   because the runtime2 backend dynamically generates and builds an
   executable from C code and calls the C-based executable with similar
   "daffodil parse ..." and "daffodil unparse ..." command lines as well.
   I think it makes more sense to add a new "daffodil generate ..."
   command which will generate the C code and executable in your own
   directory so you can do whatever you want with the C code or
   executable afterwards.
   
      e. Should we define a pluggable CodeGeneratorState interface for
   runtime2 to implement in daffodil-runtime1's DFDLParserUnparser.scala?
   We may want to wait for a future runtime3 in order to make the common
   interface clearer.
   
      f. In daffodil-runtime2's stack.{c,h}, we may need to use
   heap-allocated storage rather than statically allocated storage even
   though we use the stack only to run TDML tests.
   
   3. The generated_code.{c.h} files checked in here were originally
   mockups made in Emacs & Visual Studio Code for design and debugging
   purposes.  Depending on whether your DFDL schema is the same or
   different, runtime2 will generate a pair of generated_code.{c.h} files
   in a temporary directory with the same or different contents than
   these mockup files.  We are checking in these files only to make it
   easier to continue changing and debugging the mockup code in Visual
   Code Studio.
   
   ______________________________________________________________________
   
   ChangeLog:
   
   In .github/workflows/main.yml, install MSYS2 environment to give us a
   C compiler on Windows.  Install mxml library on both Linux and Windows
   so we can build/link our C code with it.
   
   In build.sbt, build daffodil-runtime2 module with CcPlugin configured
   to compile the runtime2 C source files into a "libruntime2.a" static
   library.
   
   In daffodil-cli/bin.NOTICE, fix an attribution notice.
   
   In daffodil-cli/build.sbt, configure the Universal plugin to include
   the runtime2 C header files and "libruntime2.a" library in an
   installed daffodil so that an installed daffodil can use the runtime2
   backend.  See also the code in GeneratedCodeCompiler which looks for
   the runtime2 C header and "libruntime2.a" library in either an
   installed daffodil location or the daffodil source tree depending on
   where and how the code is executed.  We are using a "simplest design
   that can work" approach until future requirements become clearer.
   
   In daffodil-core's Compiler.scala, add a ProcessorFactory
   generateCode method to call generateCode on a root document and
   return a fully populated CodeGeneratorState object containing
   generated C code.
   
   In daffodil-core's ElementDeclGrammarMixin.scala, add a
   RootGrammarMixin generateCode method to call generateCode on its
   document element.
   
   In daffodil-core's Grammar.scala, add a SeqComp generateCode method to
   call generateCode on its children.
   
   In daffodil-core's GrammarTerm.scala, give abstract class Gram a
   GramRuntime2Mixin trait as well as a GramRuntime1Mixin trait.
   
   In daffodil-core's Production.scala, add a Prod generateCode method to
   call generateCode on its gram object.
   
   In daffodil-core's ElementCombinator.scala, add an ElementCombinator
   generateCode method to call generateCode on its subComb object, add
   empty CaptureContentLengthStart, CaptureContentLengthEnd,
   CaptureValueLengthStart, and CaptureValueLengthEnd generateCode
   methods, and add an ElementParseAndUnspecifiedLength generateCode
   method to instantiate and call generateCode on an
   ElementParserGenerator.
   
   In daffodil-core's PrimitivesBinaryNumber.scala, add a
   BinaryIntegerKnownLength generateCode method to call generateCode on
   its generator object.
   
   In daffodil-core's SequenceChild.scala, add a
   ScalarOrderedSequenceChild generateCode method to call generateCode on
   its term's termContentBody object.
   
   In daffodil-core's SequenceCombinator.scala, add an OrderedSequence
   generateCode method to call generateCode on a single child of a
   sequence (TODO: need to support generating code for each child in a
   sequence with multiple children).
   
   In daffodil-core's SpecifiedLength.scala, add a
   SpecifiedLengthImplicit generateCode method to call generateCode on
   its eGram object.
   
   In daffodil-core's runtime2/GeneratedCodeCompiler.scala, implement a
   GeneratedCodeCompiler compile method to find the runtime2 C header
   files and "libruntime2.a" library, write the generated C code to a
   temporary directory, run the C compiler, capture any compilation
   diagnostics, and add them to its ProcessorFactory's diagnostics.  Also
   implement a GeneratedCodeCompiler dataProcessor method to return a
   Runtime2DataProcessor object with the path of the executable that was
   just compiled.
   
   In daffodil-core's runtime2/GramRuntime2Mixin.scala, implement a
   GramRuntime2Mixin trait with a generateCode method which throws a SDE
   if a subclass doesn't implement the generateCode method.
   
   In daffodil-core's runtime2/Runtime2DataProcessor.scala, implement a
   Runtime2DataProcessor class which extends/implements
   DFDL.DataProcessorBase while adding its own new parse and unparse
   methods.  The parse method writes the input file to a temporary
   directory, runs the executable in that directory telling to parse the
   input file and write an output file, creates a ParseResult object with
   the path of the output file, adds any runtime errors to the
   ParseResult's diagnostics, and returns the ParseResult object.  The
   unparse method also writes the input file to a temporary directory,
   runs the executable in that directory telling it to unparse the input
   file and write an output file, creates an UnparseResult object with
   the path of the output file, adds any runtime errors to the
   UnparseResult's diagnostics, and returns the UnparseResult object.
   Implement a Runtime2DataLocation object with all zero fields since we
   can't track the executable's read position anyway.  Implement both
   ParseResult and UnparseResult classes using that Runtime2DataLocation
   object.  Make ParseResult load the output file and return its XML
   data.  Make UnparseResult save and return the unparse output file's
   length as its finalBitPos0b field for roundtrip processing by TDML
   tests.
   
   In daffodil-core's
   runtime2/generators/BinaryNumberParserGenerator.scala, implement a
   BinaryIntegerKnownLengthParserGenerator class with a generateCode
   method that generates the C code needed to initialize, parse, and
   unparse 32-bit integer fields.  Initialize the field to the memory bit
   pattern 0xCDCDCDCD since I'd already had to fix a bug that was leaving
   fields uninitialized; this distinctive bit pattern should make such
   bugs more obvious.  TODO: Make the generated C code as secure as
   possible using Language-Theoretic Security functions if possible.
   
   In daffodil-core's runtime2/generators/ElementParserGenerator.scala,
   implement a ElementParserGenerator class with a generateCode method
   that makes the CodeGeneratorState calls needed for both complex and
   simple elements.
   
   In daffodil-core's runtime2/generators/ParserGenerator.scala,
   implement a ParseGenerator trait with a generateCode method and
   implement a CodeGeneratorState class with many methods to generate and
   accumulate strings of generated C code.  Implement a ComplexCGState
   class to accumulate strings of generated C code for nested elements
   inside complex elements.  TODO: In defineQNameInit, we try to optimize
   away a single extraneous xmlns=ns declaration in a child element when
   its parent has the same xmlns=ns declaration, but our approach doesn't
   handle multiple xmls=ns declarations and has not been tested on corner
   cases yet.
   
   In daffodil-core's runtime2/TestGeneratedCodeCompiler.scala, write
   methods to test GeneratedCodeCompiler's compile method and
   Runtime2DataProcessor's parse and unparse methods.
   
   In daffodil-core's tdml.xsd, add "daffodil-runtime2" as a new TDML
   implementation enumeration as well as "daffodil" and "ibm".
   
   In daffodil-propgen's dafext.xsd, add "tdmlImplementation" as a new
   tunable with default value "daffodil".  See also the code in
   daffodil-tdml-lib's TDMLRunner.scala which instantiates three
   different TDMLDFDLProcessorFactory implementations depending on the
   tdmlImplementation tunable's value ("daffodil", "daffodil-runtime2",
   or "ibm").  Also add "runtime" as a new tunable with default value
   "runtime1" and allowed value "runtime2".  TODO: Need to define usage
   for this "runtime" tunable and implement its usage in the rest of
   daffodil.  Does it make sense to use a tunable set to "runtime2"
   instead of "runtime1" when running "daffodil parse ..." or "daffodil
   unparse ..." from the command line?  Dynamically generating, building,
   and running a C-based executable in runtime2 may not speed up these
   commands very much compared to runtime1's Scala code.  Perhaps we
   should add a new "daffodil generate ..."  subcommand which will
   generate C code from a given DFDL schema so you can use that C code to
   build your own application.
   
   In daffodil-runtime1's DFDLParserUnparser.scala, split the original
   DataProcessor trait into a DataProcessorBase trait without the
   WithDiagnostics trait or parse/unparse methods and a DataProcessor
   trait extending DataProcessorBase and adding the WithDiagnostics trait
   along with parse/unparse methods.  The reason is to allow
   Runtime2DataProcessor to extend DataProcessorBase and add its own
   parse/unparse methods with different parameters and return types
   without having to implement WithDiagnostics.  Also add a
   CodeGeneratorState trait with no methods which will be extended by
   runtime2's CodeGeneratorState class in case we need to modularize
   runtime2 for pluggability.  TODO: Should we make runtime2 pluggable?
   
   In daffodil-runtime2's .clang-format, define the C coding style to be
   used when formatting the runtime2 C files.  We are using the Barr
   Group's Embedded C style recommendations:
   
      - braces on their own lines, BSD/Allman style
      - indent 4 spaces (no tab characters)
      - align decl names on first char
      - put function definition names in first column
   
   Note we also run include-what-you-use (iwyu) on the runtime2 C files
   to make sure each file has all the #includes for everything it uses
   while removing any extraneous #includes.
   
   In daffodil-runtime2's .vscode/launch.json and tasks.json, tell Visual
   Studio Code how to compile and debug the runtime2 C files (used only
   to make development/editing of these files easier).
   
   In daffodil-runtime2's common_runtime.{c,h}, implement a walkInfoset
   method to walk a runtime2 infoset while calling VisitEventHandler
   methods, and define runtime2 common types and structs such as
   NamedQName, TypeCode, ElementRuntimeData, InfosetBase, PState, UState,
   and VisitEventHandler.
   
   In daffodil-runtime2's daffodil_argp.{c,h}, implement all the code
   needed to support the runtime2 executable's "daffodil parse" and
   "daffodil unparse" command line interface arguments (following
   daffodil's Scala CLI syntax as closely as possible).
   
   In daffodil-runtime2's daffodil_main.c, implement the runtime2
   executable's main method which doesn't need to know anything about the
   generated C code except how to initialize it by calling a
   rootInfoset() method.  The only C files which need to be generated by
   runtime2 are the two files "generated_code.h" and "generated_code.c".
   
   In daffodil-runtime2's generated_code.{c,h}, please note the files
   checked in here were originally mockups made in Visual Studio Code for
   design and debugging purposes.  Depending on whether your DFDL schema
   is the same or different, runtime2 will generate a pair of
   generated_code.{c.h} files in a temporary directory with the same or
   different contents than these mockup files.  We are checking in these
   files only to make it easier to continue changing and debugging the
   generated code in Visual Code Studio.
   
   In daffodil-runtime2's stack.{c,h}, implement a stack used by
   xml_writer.c to build an XML document while traversing the in-memory
   infoset.  Use statically allocated storage, not heap allocated
   storage, as an exercise in case we might need to use stack.c in
   another part of the runtime2 C code on an embedded device with limited
   memory.  However, the Mini-XML library requires heap allocated storage
   anyway so switch stack.c to heap allocated storage later (TODO) if it
   turns out we use mxml and stack only for TDML tests.
   
   In daffodil-runtime2's xml_reader.{c,h}, implement VisitEventHandler
   methods to walk an runtime2 infoset and use XML data from an input
   file to initialize the in-memory infoset.
   
   In daffodil-runtime2's xml_writer.{c,h}, implement VisitEventHandler
   methods to walk an runtime2 infoset, push nested XML nodes on a stack,
   and write the complete XML data to an output file when the walk is
   complete.  The stack has a statically defined maximum depth of 100
   nested nodes right now which probably will be changed later.
   
   In daffodil-runtime2's TestRuntime2.dfdl.xsd, define an example DFDL
   schema to be used for tests.  Right now the schema has only one
   top-level complex element containing three xs:int simple elements to
   be parsed and unparsed.
   
   In daffodil-runtime2's TestRuntime2.tdml, define a suite of TDML tests
   with both config-runtime1 and config-runtime2 configurations and a
   defaultConfig that selects one of them so you can run the TDML tests
   with either runtime1 or runtime2.  Right now we have only two
   "parse_int32" and "unparse_int32" tests with corresponding
   "parse_int32" and "unparse_int32" files to be parsed/unparsed.
   
   In daffodil-runtime2's TestRuntime2.scala, define a TDML runner to run
   the runtime2 suite of TDML tests from the "sbt test" command line.
   
   In daffodil-tdml-lib's TDMLRunner.scala, extend the TestCase
   tdmlDFDLProcessorFactory method to allow three different
   TDMLDFDLProcessorFactory implementations to be used depending on the
   corresponding value of the tdmlImplementation tunable ("daffodil",
   "daffodil-runtime2", or "ibm").  Fix a typo in UnparseTestCase's
   roundtrip error message.
   
   In daffodil-tdml-processor's Runtime2TDMLDFDLProcessor.scala,
   implement a TDMLDFDLProcessorFactory class with implementationName
   "daffodil-runtime2" and a getProcessor method which runs
   GeneratedCodeCompiler's compile method and returns a
   Runtime2TDMLDFDLProcessor ready to run the executable.  Implement a
   Runtime2TDMLDFDLProcessor class with parse and unparse methods which
   run Runtime2DataProcessor's parse and unparse methods and return
   Runtime2TDMLParseResult and Runtime2TDMLUnparseResult objects.
   Implement Runtime2TDMLParseResult and Runtime2TDMLUnparseResult as
   wrapper classes around runtime2's ParseResult and UnparseResult
   classes.
   
   In projects/Dependencies.scala, add a com.lihaoyi %% os-lib dependency
   to let GeneratedCodeCompiler and Runtime2DataProcessor create the
   files they need to write or read and call the os commands they need to
   compile the C code and run the executable.
   
   In project/Rat.scala, fix a typo.
   
   In project/plugins.sbt, make sbt use com.github.tnakamot % sbt-cc as
   one of its plugins.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-daffodil] tuxji opened a new pull request #422: WIP: Add runtime2 backend with C code generator

Reply via email to