As a follow on to my initial post.

RAT 0.17 works as follows:

Collection of files are submitted.  This can be a collection of 1, a list
of specific files, a directory, or an archive.
As each file is processed the DocumentName is created and a Metadata
object is associated.

Include/exclude determines if the file should be processed and updates
metadata.
Apache Tika processes the file and extracts the MediaType, document type,
and character set.

for STANDARD files: scan for matching licenses and update the metadata
for ARCHIVE files: determine level of action and scan archive adding
documents and metadata as appropriate.

write the file data from metadata into XML document.

process the XML document into output with XSLT.

RAT 1.0.0 should work as follows:

Collection of files are submitted.  This can be a collection of 1, a list
of specific files, a directory, or an archive.
As each file is processed the DocumentName is created and a Metadata
object is associated.

Apache Tika processes the file and extracts the MediaType, document type,
and character set.

a series of plugins is consulted.
Editor plugins, if any, are applied in sequential order first.

scanner plugins are processed next.   Scanners will determine which files
include/exclude.
The RAT license scanner plugin will do the following; other plugins (Notice
scanner, crypto detection) will operate in similar fashion:

   - include/exclude determines if the file should be processed and updates
   metadata.
   - for STANDARD files: scan for matching licenses and update the metadata
   - for ARCHIVE files: determine level of action and scan archive adding
   documents and metadata as appropriate.

The metadata for each file will be processed into an XML document.

The XML document will be processed with XSLT to produce output.

The major changes here are that the Metadata will need to have namespaced
properties so that each plugin can produce it's own data without worrying
about conflicting names.
The command line options for each plugin will need to be prefixed with an
abbreviated name for the plugin.
Each plugin will produce an Apache commons_cli Options object to describe
it's command line options.
The system will need to determine the plugins that are available so that it
can generate the complete CLI Options object.

Reply via email to