[
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045219#comment-14045219
]
Matthias Krueger commented on TIKA-1332:
----------------------------------------
It might be good to distinguish between the regression testing aspect of
nightly runs and the "extraction gap discovery" aspect of running Tika against
a large batch of previously untested docs.
For regression testing it would be good to generate stats on a run and compare
them with the last known "good" stats. These stats could include:
* Number/distribution of detected mime types
* Number of thrown exceptions thrown per type of exception
* Frequencies of metadata key-value pairs
* Frequencies of different word lengths extracted from content (per file type)
This could be run unsupervised with the delta to the last known "good" run
summarized in a daily report.
Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's
cases) sounds more like "gap discovery" which I guess would always need some
supervision.
> Create "eval" code
> ------------------
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
> Issue Type: Sub-task
> Components: cli, general, server
> Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of
> exceptions per file type, most common exceptions per file type, number of
> metadata items, total text extracted, etc). We should also be able to
> compare one run against another. Going forward, there's plenty of room to
> improve.
--
This message was sent by Atlassian JIRA
(v6.2#6252)