David Rosenthal
Wed, 14 Jul 2010 10:00:52 -0700
Swithun Crowe wrote:
HelloSH> Swithun, Interesting effect of using multiple format ID tools in a SH>bundled arrangement such as FITS. Presumably we have to understand and SH>rectify the inconsistencies rather than just record them. What degree SH>of disagreement have you found?From the files I've been testing FITS with, a lot of the conflicting identifications come from .doc documents. Some are RTF documents in disguise, which can be picked up as RTF or as plain text. Others are picked up as Word documents or as Excel documents. When DROID can't determine the version of a .doc, FITS records this as a partial identification, so DROID's contribution is not used.
This is an example of a problem discussed at length in a January 2009 post to my blog: http://blog.dshr.org/2009/01/postels-law.html Format tools like JHOVE and DROID are far from infallible. If they are well maintained, their results should improve over time. That is, running them at time A will generate result F(A), running them at time B>A will generate results F(B). If S(X) is the percentage of false positives and false negatives, then S(F(B)) <= S(F(A)) because bugs and bad data will have been fixed in the interval A to B. This means that running these tools, remembering their results, and using those results at a later time is a very bad idea. If the information is needed at a later time, the tools should be re-run. And the information should be used with the knowledge that some of the results at any given time will be wrong. The certainty in PREMIS is the problem. Format identifications, as with every type of metadata, will always be uncertain to some extent. David.