[RAT][DISCUSS] document categorization.

Claude Warren Sat, 04 May 2024 23:53:02 -0700

*Current state*

We attempt to provide a default configuration that is ASF requirements
based.

We currently categorize documents into one of six categories: Generated,
Unknown, Archive, Notice, Binary, Standard.

- Standard documents get scanned for the presence or absence of license
headers.
- Archive documents may get scanned for the presence or absence of
license headers.
- Notice files are determined by file name [1] and are excluded from
processing
- Generated files are determined by content scanning for key phrases and
are excluded from processing.
- Unknown files are files that can not otherwise be categorized.
- Binary files are noted but not processed.

We have filters to remove documents from processing based on filename or
directory. But no good access to them via the command line.

We have some filters (at least in the Maven plugin) that will remove files
based on their use in source code control systems.

We are using Tika to categorize the documents. Tika produces a mime type
for each document.

*Proposed Changes*

1. Remove the NOTICE category. I think that the Notice concept is
incorrect and should be handled by other means. Some of the notices (e.g.
AUTHOR, AUTHOR.TXT, UPGRADE, UPGRADE.TXT) could simply be excluded by the
file name exclusion process. Other notices (e.g. LICENSE, LICENSE.txt)
should be scanned to determine what license is specified. The change to
scanning LICENSE files would significantly help the Archive processor.
2. Deprecate the -e, --exclude, -E, --exclude-file,
--scan-hidden-directories command line arguments in favor of:
- --exclude-file-literal to exclude literal file name (e.g.
"AUTHOR.TXT")
- --exclude-file-wildcard to exclude files based on file wildcards
(e.g. "AUTHOR.*")
- --exclude-file-regex to exclude files based on regular expressions
(e.g. "AUTHORS?(\.[Tt][Xx][Tt])?")
- --exclude-dir-literal to exclude literal directory names
- --exclude-dir-wildcard to exclude wildcard directory names
- --exclude-dir-regex to exclude directories based on regular
expressions.
- --exclude-contents-literal to exclude files based on a literal
match to text in the file.
- --exclude-contents-regex tp exclude files based on a regular
expression match to text in the file.
- --exclude-source to exclude files and directories based on the
input of a file. Multiple file structures could be accepted but
in general
it has a flag for the file/directory/contents trichotomy and the
literal/wildcard/regex trichotomy. For example:
file:literal:AUTHOR.TXT or
<contents><literal>Generated by</literal></contents>
- --no-default-exclude to remove any exclusions that are included by
default.
3. Add a count of all excluded files to the XML report. This should
include counts broken down by the exclude type file, directory, contents
and literal, wildcard, regex.
4. Remove the GENERATED category. This is actually an exclusion based
on content and is handled above.
5. Add some processing to the BINARY files. These files include image
and audio files that may have licensing information in their metadata. Add
"--binary <ProcessingType>" command line argument similar to "--archive
<ProcessingType>" [2] to describe how to handle binary files and use the
Tika capabilities to extract the text and/or metadata for processing like
ARCHIVE processing.
6. Add the mime type to the Resource element in the XML output as it can
help in detailed reporting.

This will reduce the number of document types to four: UNKNOWN, BINARY,
ARCHIVE, STANDARD; and will simplify the processing of documents while
giving users the ability to fine tune how files are processed.

Thoughts?
Claude

[1]
https://github.com/apache/creadur-rat/blob/master/apache-rat-core/src/main/java/org/apache/rat/document/impl/guesser/NoteGuesser.java
[2] https://github.com/apache/creadur-rat/pull/246
--
LinkedIn: http://www.linkedin.com/in/claudewarren

[RAT][DISCUSS] document categorization.

Reply via email to