We have developed two new NiFi processors, called DaffodilParse and DaffodilUnparse, which add support for the Daffodil open source project [1] to NiFi. We were interested in any feedback the NiFi development community might have. The code for the processors is available at the following link:
https://opensource.ncsa.illinois.edu/bitbucket/projects/DFDL/repos/daffodil-nifi/browse Note that this currently depends on a snapshot of the latest version of Daffodil, so this likely is not the final form, but it is functional and gives a good idea of how we think a Daffodil processor might work. A little about Daffodil, for approximately the past 5 years, a group of us have been working on the Daffodil project, an open source implementation of the Data Format Description Language (DFDL) [2]. At a very high level, DFDL defines a language that describes a wide variety of data formats [3], including both text and binary, using XML schema and annotations. It also defines how a DFDL implementation can use this description to "parse" data into an XML infoset, and how this infoset can be "unparsed" or serialized back into the original file format. By using an XML infoset, DFDL provides a simple mechanism that allows one to take advantage of the many XML technologies (e.g. XProc, XPath, XSLT, Schematron) to validate, manipulate, create, and ingest complex data formats. The Daffodil project is nearing the 2.0 release, which will include support for both parsing and unparsing many complex data formats. With this maturity, we think one potential use case for Daffodil is a NiFi processor that can ingest data and parse it to XML. This XML can then be validated/queried/transformed with the various existing NiFi XML processors (e.g. EvaluateXQuery, SplitXml, ValidateXml, TransformXml) and flow into other processors. A second Daffodil NiFi processor could read the resulting XML and unparse it back to the original file format. The two processors mentioned above do exactly that. If you would like to try out the processors, the usual 'mvn install' will create a nar file containing the two processors. Both processors require a single parameter to the path of a DFDL schema file (ending in .dfdl.xsd by convention). The test directory in the repository contains a DFDL schema describing CSV and a test file. However, the PCAP schema, found here https://github.com/DFDLSchemas/PCAP is a bit more interesting, describing multiple layers of the network stack of a packet capture file, showing things like IPv6, IPv4, MAC/IP addresses, ports, protocols, etc. The PCAP DFDL schema is in the src/main/resources/xsd directory, with some example PCAP files in src/tests/resources/tests. These have all been tested to work with NiFi 1.1.1. Thanks and we look forward to any feedback, - Steve [1] https://opensource.ncsa.illinois.edu/confluence/display/DFDL/Daffodil%3A+Open+Source+DFDL [2] https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl [3] https://github.com/DFDLSchemas
