Understood. Thanks for the interest! - Steve
On 08/02/2017 02:57 PM, Dave Fisher wrote: > Hi Steve, > > It was not so much the lack of committers as it was the current diversity. > That is not a blocker for entry to Incubation. > > I am willing to be one of the Mentors. Once there are at least two more we > can push forward. > > Regards, > Dave > >> On Aug 1, 2017, at 5:09 AM, Steve Lawrence <stephen.d.lawre...@gmail.com> >> wrote: >> >> Discussions have died down, and I think the consensus from the responses >> is that the issues are 1) the lack of committers and 2) the lack of a >> champion and mentors. We hope to address #1 and grow the community as >> part of incubation. Is anyone interested in being a champion or mentor >> and help us with #2? >> >> Thanks, >> - Steve >> >> On 07/26/2017 04:06 PM, Chris Mattmann wrote: >>> This sounds like a very interesting project. >>> >>> I don’t have the time to mentor at the moment but I will keep a close eye >>> on it. >>> >>> Cheers, >>> Chris Mattmann >>> >>> >>> >>> >>> On 7/25/17, 11:53 AM, "McHenry, Kenton Guadron" <mche...@illinois.edu> >>> wrote: >>> >>> Hi Dave, >>> >>> The developers that were at NCSA have moved on to other organizations. >>> While we still leverage Daffodil and are very much interested in seeing it >>> move forward, development is currently done by the Tresys team. Agreed on >>> the synergy with Tika. >>> >>> Kenton McHenry, Ph.D. >>> Principal Research Scientist, Adjunct Assistant Professor of Computer >>> Science >>> Deputy Director of the Scientific Software & Applications Division >>> National Center for Supercomputing Applications, University of Illinois >>> at Urbana-Champaign >>> >>> On Jul 24, 2017, at 1:55 PM, Dave Fisher >>> <dave2w...@comcast.net<mailto:dave2w...@comcast.net>> wrote: >>> >>> Hi Kenton, >>> >>> Is there any reason that you and others from the NCSA are not Initial >>> Committers? That would make this proposal stronger. >>> >>> Regarding Apache Tika - it relies on other projects including Apache POI >>> and Apache PDFBox. They are pragmatic about what is used. If Daffodil works >>> to expand then I think that there would be good synergy between the >>> projects. I know as a POI PMC member that the POI community has >>> significantly benefited from the Tika community some of whom are from Mitre. >>> >>> To date Tika has not emphasized structured data, although they do >>> extract content from Excel and OpenOffice. >>> >>> I am intrigued. >>> >>> Regards, >>> Dave >>> >>> On Jul 24, 2017, at 10:55 AM, McHenry, Kenton Guadron >>> <mche...@illinois.edu<mailto:mche...@illinois.edu>> wrote: >>> >>> Yes, DFDL and its open source implementation Daffodil are more about >>> file formats and getting access to the entirety of a file's contents in a >>> consistent way through machine readable specifications. The work has >>> implications in the area of digital preservation allowing one to preserve >>> these machine readable specifications rather than all the tools needed to >>> open/save a file in order to work with it. Imagine someone developing >>> graphics software to work with 3D models and not having to worry about the >>> hundreds of formats out there for 3D meshes (whether there are tools for >>> opening the files and whether they can get access to those tools, whether >>> the spec is available and worrying about how complex that spec is to >>> implement, etc.), and simply building their code around the contents (e.g. >>> vertices, faces, etc.). One could come up with similar scenarios for other >>> data types (documents, images, videos, audio, depth data, numeric data). >>> Ideally tools built supporting DFDL, could someday, support any format for >>> that type without the developer having to worry about the details of how >>> that data is represented within a file. >>> >>> Kenton McHenry, Ph.D. >>> Principal Research Scientist, Adjunct Assistant Professor of Computer >>> Science >>> Deputy Director of the Scientific Software & Applications Division >>> National Center for Supercomputing Applications, University of Illinois >>> at Urbana-Champaign >>> >>> On Jul 24, 2017, at 10:30 AM, Steve Lawrence >>> <stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>> >>> wrote: >>> >>> I'll preface this saying that I don't have a ton of experience with >>> Apache Tika. But based on my understanding, Tika and Daffodil do have >>> somewhat similar goals, but reach them in different ways. For example, >>> Tika requires that one writes /code/ to perform data extraction, usually >>> relying on existing Java libraries to extract the desired metadata. The >>> downside to this is that code can be buggy, and libraries might not even >>> exist for formats of interest (especially common with legacy and >>> military data). >>> >>> Daffodil, on the other hand, does not require one to write any code. >>> Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL >>> annotations) that fully describes the data, which Daffodil then uses to >>> convert the data to XML/JSON for extraction. So adding support for a new >>> format means writing a new schema rather than new code. And less code >>> generally means less bugs. Also, for secure systems that require >>> certification, generally speaking, it is easier to certify a schema as >>> compared to code. >>> >>> We certainly don't believe that Daffodil could replace Tika, but it does >>> have the potential to add new functionality to Tika for formats that do >>> not have existing libraries. One of our goals is to look into >>> integrating Daffodil support into tools like Tika. We'd love to hear >>> from Tika devs if this is something they'd be interested in. >>> >>> I'll also add that whereas Tika tends to focus primarily on metadata, >>> DFDL schemas usually describe an entire file format down to the byte, so >>> one can extract more than just meta data, including text and binary >>> data. Further differentiating, Daffodil has support for serializing data >>> (called unparse) from the XML/JSON representation, allowing one to >>> transform or filter data as well. We don't believe this feature is all >>> that applicable to Tika, but may be useful to other technologies such as >>> filtering or data fuzzing technologies. >>> >>> - Steve >>> >>> >>> On 07/24/2017 10:59 AM, Mike Drob wrote: >>> What is the relationship between Daffodil and something like Apache >>> Tika's >>> extraction engine? >>> >>> On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence < >>> >>> stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>> >>> wrote: >>> >>> Dear Apache Incubator Community, >>> >>> We would like to start a discussion around a proposal to bring Daffodil >>> into the Apache Incubator. Daffodil is a implementation of the DFDL >>> specification used to convert between fixed format data and XML/JSON. >>> >>> The draft proposal can be found in the wiki at the following URL: >>> >>> https://wiki.apache.org/incubator/DaffodilProposal >>> >>> We do not yet have a champion or mentors, but it was recommended that we >>> create a proposal and send it to this list to potentially find those >>> that might be interested. The text for the draft proposal is found >>> below. We look forward to your input. >>> >>> Thanks, >>> -Steve >>> >>> >>> = Daffodil Proposal = >>> >>> == Abstract == >>> >>> Daffodil is an implementation of the Data Format Description Language >>> (DFDL) used to convert between fixed format data and XML/JSON. >>> >>> == Proposal == >>> >>> The Data Format Description Language (DFDL) is a specification, >>> developed by the Open Grid Forum, capable of describing many data >>> formats, including both textual and binary, scientific and numeric, >>> legacy and modern, commercial record-oriented, and many industry and >>> military standards. It defines a language that is a subset of W3C XML >>> schema to describe the logical format of the data, and annotations >>> within the schema to describe the physical representation. >>> >>> Daffodil is an open source implementation of the DFDL specification that >>> uses these DFDL schemas to parse fixed format data into an infoset, >>> which is most commonly represented as either XML or JSON. This allows >>> the use of well-established XML or JSON technologies and libraries to >>> consume, inspect, and manipulate fixed format data in existing >>> solutions. Daffodil is also capable of the reverse by serializing or >>> "unparsing" an XML or JSON infoset back to the original data format. >>> >>> == Background == >>> >>> Many different software solutions need to consume and manage data, >>> including data directed routing, databases, data analysis, data >>> cleansing, data visualizing, and more. A key aspect of such solutions is >>> the need to transform the data into an easily consumable format. >>> Usually, this means that for each unique data format, one develops a >>> tool that can read and extract the necessary information, often leading >>> to ad-hoc and data-format-specific description systems. Such systems are >>> often proprietary, not well tested, and incompatible, leading to vendor >>> lock-in, flawed software, and increased training costs. DFDL is a new >>> standard, with version 1.0 completed in October of 2016, that solves >>> these problems by defining an open standard to describe many different >>> data formats and how to parse and unparse between the data and XML/JSON. >>> >>> Two closed source implementations of DFDL currently exist. The first was >>> created by IBM and is now part of their IBM® Integration Bus product. >>> The second was created by the European Space Agency, called DFDL4S or >>> "DFDL for Space" targeted at the challenges of their satellite data >>> processing. >>> >>> Around 2005, Pacific Northwest National Lab created Defuddle, built as >>> an open source implementation and proof of concept of the draft DFDL >>> specification and a test bed to feed new concepts into specification >>> development. Primary development of Defuddle was eventually taken over >>> by the National Center for Supercomputing Applications (NCSA). However, >>> due to evolution of the DFDL specification and architectural and >>> performance issues with Defuddle, around 2009, NCSA restarted the >>> project with the new name of Daffodil, with a goal of implementing the >>> complete DFDL specification. Daffodil development continued at NCSA >>> until around 2012, at which point development slowed due to budget >>> limitations. Shortly thereafter, primary development was picked up by >>> Tresys Technology where it continues today, with contributions from >>> other entities such as the Navy Research Lab, the Air Force Research >>> Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil >>> version 1.0.0 was released, including support for the DFDL features >>> needed to parse many common file formats. Daffodil version 2.0.0 is >>> expected to be released in August of 2017, which will include unparse >>> support with one-to-one parsing feature parity. >>> >>> Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, Quark >>> Security, Raytheon, and Tresys Technology have developed DFDL schemas >>> for many data formats from varying technology domains, including PNG, >>> GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and MIL-STD-2045, >>> many of which are publicly available on the DFDL Schemas github. There >>> are also a number of military-application data formats, the >>> specifications of which are not public, which have historically been >>> very difficult and expensive to process, and for which DFDL schemas have >>> been created or are actively in development; these include >>> MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG 5516 >>> (aka "Link16"). >>> >>> == Rationale == >>> >>> Numerous software solutions exist that consume, inspect, analyze, and >>> transform data, many of which can be found in the Apache Software >>> Foundation (ASF). In order for tools like these to consume new types of >>> data, custom extensions are usually required, often with high >>> development and testing costs. Daffodil fills a clear gap in many of >>> these solutions, providing a simple and low cost way to transform data >>> to XML or JSON, which many of these tools natively support already. With >>> the upcoming 2.0.0 release, the Daffodil project will have achieved a >>> level of functionality in both parse and unparse that, when integrated >>> into existing solutions, could provide for a new method to quickly >>> enable support for new data formats. >>> >>> == Initial Goals == >>> >>> * Relicense the existing code from the University of Illinois/NCSA Open >>> Source License to the Apache License version 2.0, working with Apache >>> Legal to ensure correctness, and with Daffodil contributors to get >>> their permission. >>> * Move the existing codebase, documentation, bugs, and mailing lists to >>> the Apache hosted infrastructure >>> * Establish a formal release process and schedule, allowing for >>> dependable release cycles in a manner consistent with the Apache >>> development process. >>> * Build relationships with ASF projects to add Daffodil support where >>> appropriate >>> * Grow the community to establish a diversity of background and >>> expertise. >>> >>> == Current Status == >>> >>> === Meritocracy === >>> >>> All initial committers are familiar with the principles of meritocracy. >>> The Daffodil project has followed the model of meritocracy in the past, >>> providing multiple outside entities commit access based on the quality >>> of their contributions. In order to grow the Daffodil user base and >>> development community, we are dedicated to continuing to operate >>> Daffodil as a meritocracy. >>> >>> A key ingredient in a meritocracy of developers is open group code >>> review. The Daffodil project has operated in this mode throughout its >>> existence and this provides a forum to improve the code, verify code >>> quality, and educate new developers on the code base. >>> >>> === Community === >>> >>> Daffodil has a small community of users and developers. Although primary >>> Daffodil development is done by Tresys Technology, a handful of other >>> contributions have come from other entities including the Navy Research >>> Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In >>> addition to developers, multiple users of Daffodil have created DFDL >>> schemas, including entities such as MITRE, IBM, Raytheon, Quark >>> Security, and Tresys Technology. The DFDL Schemas github community has >>> been created as a place for DFDL schemas to be published. The Daffodil >>> project also makes use of mailing lists, !HipChat, and Confluence >>> Questions to build a community of users and system for support. >>> >>> === Core Developers === >>> >>> The core developers of Daffodil are employed by Tresys Technology. We >>> will work to grow the community among a more diverse set of developers >>> and industries. >>> >>> === Alignment === >>> >>> Daffodil was created as an open source project with a philosophy >>> consistent with The Apache Way. A strong belief in meritocracy, >>> community involvement in decisions, openness, and ensuring a high level >>> of quality in code, documentation, and testing are some of our shared >>> core beliefs. >>> >>> Further, as mentioned in the Rationale section, Daffodil fills a gap >>> that exists in many ASF projects, including !NiFi, Spark, Storm, Hadoop, >>> Tika, and others. In order for tools like these to consume new types of >>> data, custom extensions are usually required. Rather than create such >>> extensions, Daffodil provides an easy and standards-compliant way to >>> transform data to XML or JSON, which many of these tools already >>> natively support. >>> >>> == Known Risks == >>> >>> === Orphaned Products === >>> >>> The current core developers are the leading contributors in the space of >>> DFDL and wish to see it flourish. Though there is some risk that the >>> initial committers all come from the same company, a goal of entering >>> into incubation is to grow the development community to minimize the >>> risk of reliance on a single company. >>> >>> === Inexperience with Open Source === >>> >>> The Daffodil project began as an open source project and has continued >>> that model throughout development. This includes public bug tracking, >>> git revision control, automated builds and tests, and a public wiki for >>> documentation. >>> >>> Additionally, the current core developers and initial committers all >>> work for a company that relies on, believes in, promotes, and has led or >>> contributed to many open source software projects, including SELinux >>> Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As such, >>> there is low risk related to inexperience with open source software and >>> processes. >>> >>> === Homogeneous Developers === >>> >>> The proposed initial committers come from a single entity, though we are >>> committed to growing the Daffodil development community to include a >>> broad group of additional committers from a wide array of industries. >>> >>> === Reliance on Salaried Developers === >>> >>> The proposed initial committers are paid by their employer to contribute >>> to the Daffodil project. We expect that Daffodil development will >>> continue with salaried developers, and are committed to growing the >>> community to include non-salaried developers as well. >>> >>> === Relationship with other Apache Projects === >>> >>> As mentioned in the Alignment section, Daffodil fills a clear gap in >>> numerous other ASF projects that consume and manage large amounts of >>> data. >>> >>> As a specific example, Daffodil developers have created a Daffodil >>> Apache !NiFi Processor, currently in use in data transfer solutions, >>> which allows one to ingest non-native data into an Apache !NiFi pipeline >>> as XML or JSON. This processor was well received by the Apache !NiFi >>> developers, with positive comments about the concise API and how it >>> could handle non-native data. Daffodil developers have also successfully >>> prototyped integration with Apache Spark. We believe Daffodil could >>> provide a strong benefit to many other ASF projects that handle fixed >>> format data. We anticipate working closely with such ASF projects to >>> include Daffodil where applicable to increase their ability to support >>> new data formats with minimal effort. >>> >>> Daffodil also depends on existing ASF projects, including Apache Commons >>> and Apache Xerces. >>> >>> === An Excessive Fascination with the Apache Brand === >>> >>> Although the Apache brand may certainly help to attract more >>> contributors, publicity is not the reason for this proposal. We believe >>> Daffodil could provide a great benefit to the ASF and the numerous data >>> focused projects that comprise it, as described in the Rationale and >>> Alignment sections. We hope to build a strong and vibrant community >>> built around The Apache Way, and not dependent on a single company. >>> >>> === Documentation === >>> >>> Daffodil documentation can be found at: >>> >>> * >>> https://opensource.ncsa.illinois.edu/confluence/ >>> display/DFDL/Daffodil%3A+Open+Source+DFDL >>> >>> Information about DFDL can be found at: >>> >>> * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl >>> * >>> https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0. >>> 0/com.ibm.etools.mft.doc/df20060_.htm >>> >>> Public examples of DFDL Schemas can be found at: >>> >>> * https://github.com/DFDLSchemas >>> >>> == Initial Source == >>> >>> The Daffodil git repo goes back to mid-2011 with approximately 20 >>> different contributors and feedback from many users and developers. The >>> core codebase is written in Scala and includes both a Scala and Java >>> API, along with Javadocs and Scaladocs for API usage. The initial code >>> will come from the git repository currently hosted by NCSA at the >>> University of Illinois : >>> >>> https://opensource.ncsa.illinois.edu/bitbucket/ >>> projects/DFDL/repos/daffodil/ >>> >>> == Source and Intellectual Property Submission == >>> >>> The complete Daffodil code is licensed under the University of >>> Illinois/NCSA Open Source License. Much of the current codebase has been >>> developed by Tresys Technology, who is open to relicensing the code to >>> the Apache License version 2.0 and donate the source to the ASF. >>> Contacts at NCSA are also open to relicensing their contributions to >>> Apache v2. We plan to contact the other contributors and ask for >>> permission to relicense and donate their contributed code. For those >>> that decline or we cannot contact, their code will be removed or >>> replaced. We will work closely with Apache Legal to ensure all issues >>> related to relicensing are acceptable. >>> >>> == External Dependencies == >>> >>> We believe all current dependencies are compatible with the ASF >>> guidelines. Our dependency licenses come from the following license >>> styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil >>> dependencies and their licenses are documented here: >>> >>> https://opensource.ncsa.illinois.edu/confluence/ >>> display/DFDL/Dependencies+and+Licenses >>> >>> == Cryptography == >>> >>> None >>> >>> == Required Resources == >>> >>> === Mailing Lists === >>> >>> * comm...@daffodil.incubator.apache.org >>> * d...@daffodil.incubator.apache.org >>> * priv...@daffodil.incubator.apache.org >>> * u...@daffodil.incubator.apache.org >>> >>> === Source Control === >>> >>> git://git.apache.org/incubator-daffodil.git >>> >>> === Issue Tracking === >>> >>> JIRA Daffodil (DFDL) >>> >>> === Initial Committers === >>> >>> * Beth Finnegan <efinnegan at tresys dot com> >>> * Dave Thompson <dthompson at tresys dot com> >>> * Josh Adams <jadams at tresys dot com> >>> * Mike Beckerle <mbeckerle at tresys dot com> >>> * Steve Lawrence <slawrence at tresys dot com> >>> * Taylor Wise <twise at tresys dot com> >>> >>> === Affiliations === >>> >>> * Beth Finnegan (Tresys Technology) >>> * Dave Thompson (Tresys Technology) >>> * Josh Adams (Tresys Technology) >>> * Mike Beckerle (Tresys Technology) >>> * Steve Lawrence (Tresys Technology) >>> * Taylor Wise (Tresys Technology) >>> >>> == Sponsors == >>> >>> === Champion === >>> >>> * TBD >>> >>> === Nominated Mentors === >>> >>> * TBD >>> >>> === Sponsoring Entity === >>> >>> We request the Apache Incubator to sponsor this project. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: >>> general-unsubscr...@incubator.apache.org<mailto:general-unsubscr...@incubator.apache.org> >>> For additional commands, e-mail: >>> general-h...@incubator.apache.org<mailto:general-h...@incubator.apache.org> >>> >>> >>> >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org