Hi, This looks like a very interesting proposal. It's a bit worrisome though that you have no champion or mentor. Have you been in contact with anyone at the ASF on this?
I see that the existing code appears to have 3 different copyright holders, and all code is derived from the BSD-3-clause license. It appears that all of the initial developers are from a single holder, Tresys. Is there any interest in granting committership to the other contributors? John On Mon, Jul 24, 2017 at 11:30 AM Steve Lawrence < stephen.d.lawre...@gmail.com> wrote: > I'll preface this saying that I don't have a ton of experience with > Apache Tika. But based on my understanding, Tika and Daffodil do have > somewhat similar goals, but reach them in different ways. For example, > Tika requires that one writes /code/ to perform data extraction, usually > relying on existing Java libraries to extract the desired metadata. The > downside to this is that code can be buggy, and libraries might not even > exist for formats of interest (especially common with legacy and > military data). > > Daffodil, on the other hand, does not require one to write any code. > Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL > annotations) that fully describes the data, which Daffodil then uses to > convert the data to XML/JSON for extraction. So adding support for a new > format means writing a new schema rather than new code. And less code > generally means less bugs. Also, for secure systems that require > certification, generally speaking, it is easier to certify a schema as > compared to code. > > We certainly don't believe that Daffodil could replace Tika, but it does > have the potential to add new functionality to Tika for formats that do > not have existing libraries. One of our goals is to look into > integrating Daffodil support into tools like Tika. We'd love to hear > from Tika devs if this is something they'd be interested in. > > I'll also add that whereas Tika tends to focus primarily on metadata, > DFDL schemas usually describe an entire file format down to the byte, so > one can extract more than just meta data, including text and binary > data. Further differentiating, Daffodil has support for serializing data > (called unparse) from the XML/JSON representation, allowing one to > transform or filter data as well. We don't believe this feature is all > that applicable to Tika, but may be useful to other technologies such as > filtering or data fuzzing technologies. > > - Steve > > > On 07/24/2017 10:59 AM, Mike Drob wrote: > > What is the relationship between Daffodil and something like Apache > Tika's > > extraction engine? > > > > On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence < > > stephen.d.lawre...@gmail.com> wrote: > > > >> Dear Apache Incubator Community, > >> > >> We would like to start a discussion around a proposal to bring Daffodil > >> into the Apache Incubator. Daffodil is a implementation of the DFDL > >> specification used to convert between fixed format data and XML/JSON. > >> > >> The draft proposal can be found in the wiki at the following URL: > >> > >> https://wiki.apache.org/incubator/DaffodilProposal > >> > >> We do not yet have a champion or mentors, but it was recommended that we > >> create a proposal and send it to this list to potentially find those > >> that might be interested. The text for the draft proposal is found > >> below. We look forward to your input. > >> > >> Thanks, > >> -Steve > >> > >> > >> = Daffodil Proposal = > >> > >> == Abstract == > >> > >> Daffodil is an implementation of the Data Format Description Language > >> (DFDL) used to convert between fixed format data and XML/JSON. > >> > >> == Proposal == > >> > >> The Data Format Description Language (DFDL) is a specification, > >> developed by the Open Grid Forum, capable of describing many data > >> formats, including both textual and binary, scientific and numeric, > >> legacy and modern, commercial record-oriented, and many industry and > >> military standards. It defines a language that is a subset of W3C XML > >> schema to describe the logical format of the data, and annotations > >> within the schema to describe the physical representation. > >> > >> Daffodil is an open source implementation of the DFDL specification that > >> uses these DFDL schemas to parse fixed format data into an infoset, > >> which is most commonly represented as either XML or JSON. This allows > >> the use of well-established XML or JSON technologies and libraries to > >> consume, inspect, and manipulate fixed format data in existing > >> solutions. Daffodil is also capable of the reverse by serializing or > >> "unparsing" an XML or JSON infoset back to the original data format. > >> > >> == Background == > >> > >> Many different software solutions need to consume and manage data, > >> including data directed routing, databases, data analysis, data > >> cleansing, data visualizing, and more. A key aspect of such solutions is > >> the need to transform the data into an easily consumable format. > >> Usually, this means that for each unique data format, one develops a > >> tool that can read and extract the necessary information, often leading > >> to ad-hoc and data-format-specific description systems. Such systems are > >> often proprietary, not well tested, and incompatible, leading to vendor > >> lock-in, flawed software, and increased training costs. DFDL is a new > >> standard, with version 1.0 completed in October of 2016, that solves > >> these problems by defining an open standard to describe many different > >> data formats and how to parse and unparse between the data and XML/JSON. > >> > >> Two closed source implementations of DFDL currently exist. The first was > >> created by IBM and is now part of their IBM® Integration Bus product. > >> The second was created by the European Space Agency, called DFDL4S or > >> "DFDL for Space" targeted at the challenges of their satellite data > >> processing. > >> > >> Around 2005, Pacific Northwest National Lab created Defuddle, built as > >> an open source implementation and proof of concept of the draft DFDL > >> specification and a test bed to feed new concepts into specification > >> development. Primary development of Defuddle was eventually taken over > >> by the National Center for Supercomputing Applications (NCSA). However, > >> due to evolution of the DFDL specification and architectural and > >> performance issues with Defuddle, around 2009, NCSA restarted the > >> project with the new name of Daffodil, with a goal of implementing the > >> complete DFDL specification. Daffodil development continued at NCSA > >> until around 2012, at which point development slowed due to budget > >> limitations. Shortly thereafter, primary development was picked up by > >> Tresys Technology where it continues today, with contributions from > >> other entities such as the Navy Research Lab, the Air Force Research > >> Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil > >> version 1.0.0 was released, including support for the DFDL features > >> needed to parse many common file formats. Daffodil version 2.0.0 is > >> expected to be released in August of 2017, which will include unparse > >> support with one-to-one parsing feature parity. > >> > >> Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, Quark > >> Security, Raytheon, and Tresys Technology have developed DFDL schemas > >> for many data formats from varying technology domains, including PNG, > >> GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and MIL-STD-2045, > >> many of which are publicly available on the DFDL Schemas github. There > >> are also a number of military-application data formats, the > >> specifications of which are not public, which have historically been > >> very difficult and expensive to process, and for which DFDL schemas have > >> been created or are actively in development; these include > >> MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG 5516 > >> (aka "Link16"). > >> > >> == Rationale == > >> > >> Numerous software solutions exist that consume, inspect, analyze, and > >> transform data, many of which can be found in the Apache Software > >> Foundation (ASF). In order for tools like these to consume new types of > >> data, custom extensions are usually required, often with high > >> development and testing costs. Daffodil fills a clear gap in many of > >> these solutions, providing a simple and low cost way to transform data > >> to XML or JSON, which many of these tools natively support already. With > >> the upcoming 2.0.0 release, the Daffodil project will have achieved a > >> level of functionality in both parse and unparse that, when integrated > >> into existing solutions, could provide for a new method to quickly > >> enable support for new data formats. > >> > >> == Initial Goals == > >> > >> * Relicense the existing code from the University of Illinois/NCSA Open > >> Source License to the Apache License version 2.0, working with Apache > >> Legal to ensure correctness, and with Daffodil contributors to get > >> their permission. > >> * Move the existing codebase, documentation, bugs, and mailing lists to > >> the Apache hosted infrastructure > >> * Establish a formal release process and schedule, allowing for > >> dependable release cycles in a manner consistent with the Apache > >> development process. > >> * Build relationships with ASF projects to add Daffodil support where > >> appropriate > >> * Grow the community to establish a diversity of background and > expertise. > >> > >> == Current Status == > >> > >> === Meritocracy === > >> > >> All initial committers are familiar with the principles of meritocracy. > >> The Daffodil project has followed the model of meritocracy in the past, > >> providing multiple outside entities commit access based on the quality > >> of their contributions. In order to grow the Daffodil user base and > >> development community, we are dedicated to continuing to operate > >> Daffodil as a meritocracy. > >> > >> A key ingredient in a meritocracy of developers is open group code > >> review. The Daffodil project has operated in this mode throughout its > >> existence and this provides a forum to improve the code, verify code > >> quality, and educate new developers on the code base. > >> > >> === Community === > >> > >> Daffodil has a small community of users and developers. Although primary > >> Daffodil development is done by Tresys Technology, a handful of other > >> contributions have come from other entities including the Navy Research > >> Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In > >> addition to developers, multiple users of Daffodil have created DFDL > >> schemas, including entities such as MITRE, IBM, Raytheon, Quark > >> Security, and Tresys Technology. The DFDL Schemas github community has > >> been created as a place for DFDL schemas to be published. The Daffodil > >> project also makes use of mailing lists, !HipChat, and Confluence > >> Questions to build a community of users and system for support. > >> > >> === Core Developers === > >> > >> The core developers of Daffodil are employed by Tresys Technology. We > >> will work to grow the community among a more diverse set of developers > >> and industries. > >> > >> === Alignment === > >> > >> Daffodil was created as an open source project with a philosophy > >> consistent with The Apache Way. A strong belief in meritocracy, > >> community involvement in decisions, openness, and ensuring a high level > >> of quality in code, documentation, and testing are some of our shared > >> core beliefs. > >> > >> Further, as mentioned in the Rationale section, Daffodil fills a gap > >> that exists in many ASF projects, including !NiFi, Spark, Storm, Hadoop, > >> Tika, and others. In order for tools like these to consume new types of > >> data, custom extensions are usually required. Rather than create such > >> extensions, Daffodil provides an easy and standards-compliant way to > >> transform data to XML or JSON, which many of these tools already > >> natively support. > >> > >> == Known Risks == > >> > >> === Orphaned Products === > >> > >> The current core developers are the leading contributors in the space of > >> DFDL and wish to see it flourish. Though there is some risk that the > >> initial committers all come from the same company, a goal of entering > >> into incubation is to grow the development community to minimize the > >> risk of reliance on a single company. > >> > >> === Inexperience with Open Source === > >> > >> The Daffodil project began as an open source project and has continued > >> that model throughout development. This includes public bug tracking, > >> git revision control, automated builds and tests, and a public wiki for > >> documentation. > >> > >> Additionally, the current core developers and initial committers all > >> work for a company that relies on, believes in, promotes, and has led or > >> contributed to many open source software projects, including SELinux > >> Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As such, > >> there is low risk related to inexperience with open source software and > >> processes. > >> > >> === Homogeneous Developers === > >> > >> The proposed initial committers come from a single entity, though we are > >> committed to growing the Daffodil development community to include a > >> broad group of additional committers from a wide array of industries. > >> > >> === Reliance on Salaried Developers === > >> > >> The proposed initial committers are paid by their employer to contribute > >> to the Daffodil project. We expect that Daffodil development will > >> continue with salaried developers, and are committed to growing the > >> community to include non-salaried developers as well. > >> > >> === Relationship with other Apache Projects === > >> > >> As mentioned in the Alignment section, Daffodil fills a clear gap in > >> numerous other ASF projects that consume and manage large amounts of > data. > >> > >> As a specific example, Daffodil developers have created a Daffodil > >> Apache !NiFi Processor, currently in use in data transfer solutions, > >> which allows one to ingest non-native data into an Apache !NiFi pipeline > >> as XML or JSON. This processor was well received by the Apache !NiFi > >> developers, with positive comments about the concise API and how it > >> could handle non-native data. Daffodil developers have also successfully > >> prototyped integration with Apache Spark. We believe Daffodil could > >> provide a strong benefit to many other ASF projects that handle fixed > >> format data. We anticipate working closely with such ASF projects to > >> include Daffodil where applicable to increase their ability to support > >> new data formats with minimal effort. > >> > >> Daffodil also depends on existing ASF projects, including Apache Commons > >> and Apache Xerces. > >> > >> === An Excessive Fascination with the Apache Brand === > >> > >> Although the Apache brand may certainly help to attract more > >> contributors, publicity is not the reason for this proposal. We believe > >> Daffodil could provide a great benefit to the ASF and the numerous data > >> focused projects that comprise it, as described in the Rationale and > >> Alignment sections. We hope to build a strong and vibrant community > >> built around The Apache Way, and not dependent on a single company. > >> > >> === Documentation === > >> > >> Daffodil documentation can be found at: > >> > >> * > >> https://opensource.ncsa.illinois.edu/confluence/ > >> display/DFDL/Daffodil%3A+Open+Source+DFDL > >> > >> Information about DFDL can be found at: > >> > >> * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl > >> * > >> https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0. > >> 0/com.ibm.etools.mft.doc/df20060_.htm > >> > >> Public examples of DFDL Schemas can be found at: > >> > >> * https://github.com/DFDLSchemas > >> > >> == Initial Source == > >> > >> The Daffodil git repo goes back to mid-2011 with approximately 20 > >> different contributors and feedback from many users and developers. The > >> core codebase is written in Scala and includes both a Scala and Java > >> API, along with Javadocs and Scaladocs for API usage. The initial code > >> will come from the git repository currently hosted by NCSA at the > >> University of Illinois : > >> > >> https://opensource.ncsa.illinois.edu/bitbucket/ > >> projects/DFDL/repos/daffodil/ > >> > >> == Source and Intellectual Property Submission == > >> > >> The complete Daffodil code is licensed under the University of > >> Illinois/NCSA Open Source License. Much of the current codebase has been > >> developed by Tresys Technology, who is open to relicensing the code to > >> the Apache License version 2.0 and donate the source to the ASF. > >> Contacts at NCSA are also open to relicensing their contributions to > >> Apache v2. We plan to contact the other contributors and ask for > >> permission to relicense and donate their contributed code. For those > >> that decline or we cannot contact, their code will be removed or > >> replaced. We will work closely with Apache Legal to ensure all issues > >> related to relicensing are acceptable. > >> > >> == External Dependencies == > >> > >> We believe all current dependencies are compatible with the ASF > >> guidelines. Our dependency licenses come from the following license > >> styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil > >> dependencies and their licenses are documented here: > >> > >> https://opensource.ncsa.illinois.edu/confluence/ > >> display/DFDL/Dependencies+and+Licenses > >> > >> == Cryptography == > >> > >> None > >> > >> == Required Resources == > >> > >> === Mailing Lists === > >> > >> * comm...@daffodil.incubator.apache.org > >> * d...@daffodil.incubator.apache.org > >> * priv...@daffodil.incubator.apache.org > >> * u...@daffodil.incubator.apache.org > >> > >> === Source Control === > >> > >> git://git.apache.org/incubator-daffodil.git > >> > >> === Issue Tracking === > >> > >> JIRA Daffodil (DFDL) > >> > >> === Initial Committers === > >> > >> * Beth Finnegan <efinnegan at tresys dot com> > >> * Dave Thompson <dthompson at tresys dot com> > >> * Josh Adams <jadams at tresys dot com> > >> * Mike Beckerle <mbeckerle at tresys dot com> > >> * Steve Lawrence <slawrence at tresys dot com> > >> * Taylor Wise <twise at tresys dot com> > >> > >> === Affiliations === > >> > >> * Beth Finnegan (Tresys Technology) > >> * Dave Thompson (Tresys Technology) > >> * Josh Adams (Tresys Technology) > >> * Mike Beckerle (Tresys Technology) > >> * Steve Lawrence (Tresys Technology) > >> * Taylor Wise (Tresys Technology) > >> > >> == Sponsors == > >> > >> === Champion === > >> > >> * TBD > >> > >> === Nominated Mentors === > >> > >> * TBD > >> > >> === Sponsoring Entity === > >> > >> We request the Apache Incubator to sponsor this project. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >> For additional commands, e-mail: general-h...@incubator.apache.org > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >