We have developed two new NiFi processors, called DaffodilParse and
DaffodilUnparse, which add support for the Daffodil open source project
[1] to NiFi. We were interested in any feedback the NiFi development
community might have. The code for the processors is available at the
following link:


https://opensource.ncsa.illinois.edu/bitbucket/projects/DFDL/repos/daffodil-nifi/browse

Note that this currently depends on a snapshot of the latest version of
Daffodil, so this likely is not the final form, but it is functional and
gives a good idea of how we think a Daffodil processor might work.

A little about Daffodil, for approximately the past 5 years, a group of
us have been working on the Daffodil project, an open source
implementation of the Data Format Description Language (DFDL) [2]. At a
very high level, DFDL defines a language that describes a wide variety
of data formats [3], including both text and binary, using XML schema
and annotations. It also defines how a DFDL implementation can use this
description to "parse" data into an XML infoset, and how this infoset
can be "unparsed" or serialized back into the original file format. By
using an XML infoset, DFDL provides a simple mechanism that allows one
to take advantage of the many XML technologies (e.g. XProc, XPath, XSLT,
Schematron) to validate, manipulate, create, and ingest complex data
formats.

The Daffodil project is nearing the 2.0 release, which will include
support for both parsing and unparsing many complex data formats. With
this maturity, we think one potential use case for Daffodil is a NiFi
processor that can ingest data and parse it to XML. This XML can then be
validated/queried/transformed with the various existing NiFi XML
processors (e.g. EvaluateXQuery, SplitXml, ValidateXml, TransformXml)
and flow into other processors. A second Daffodil NiFi processor could
read the resulting XML and unparse it back to the original file format.
The two processors mentioned above do exactly that.

If you would like to try out the processors, the usual 'mvn install'
will create a nar file containing the two processors. Both processors
require a single parameter to the path of a DFDL schema file (ending in
.dfdl.xsd by convention). The test directory in the repository contains
a DFDL schema describing CSV and a test file. However, the PCAP schema,
found here

  https://github.com/DFDLSchemas/PCAP

is a bit more interesting, describing multiple layers of the network
stack of a packet capture file, showing things like IPv6, IPv4, MAC/IP
addresses, ports, protocols, etc. The PCAP DFDL schema is in the
src/main/resources/xsd directory, with some example PCAP files in
src/tests/resources/tests. These have all been tested to work with NiFi
1.1.1.

Thanks and we look forward to any feedback,
- Steve


[1]
https://opensource.ncsa.illinois.edu/confluence/display/DFDL/Daffodil%3A+Open+Source+DFDL
[2] https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
[3] https://github.com/DFDLSchemas

Reply via email to