Swithun Crowe
Wed, 14 Jul 2010 03:19:09 -0700
Hello SH> Swithun, Interesting effect of using multiple format ID tools in a SH>bundled arrangement such as FITS. Presumably we have to understand and SH>rectify the inconsistencies rather than just record them. What degree SH>of disagreement have you found? >From the files I've been testing FITS with, a lot of the conflicting identifications come from .doc documents. Some are RTF documents in disguise, which can be picked up as RTF or as plain text. Others are picked up as Word documents or as Excel documents. When DROID can't determine the version of a .doc, FITS records this as a partial identification, so DROID's contribution is not used. Ideally the job of improving identification and using standard identifiers would be pushed back to the tools that FITS uses. This probably happens already to some extent. And FITS is getting better all the time at standardising the output from tools and producing fewer false negatives. A good identification is when several tools agree and none disagree. The other situations are where only one tool produces an identification, several tools produce conflicting identifications, or no tool produces a full identification. FITS labels these negative results as 'SINGLE_RESULT', 'CONFLICT' and 'PARTIAL'. I'm looking at the FITS code to see how to reduce the number of false negatives further. But there never will be perfect identification, and this won't bother people. So some inconsistencies and indications of uncertainty will just need to be recorded rather than resolved (or lost). One reply I had said that the PREMIS standard might evolve in the future to include indications of certainty. At the moment, there is nothing wrong with extending the current PREMIS standard oneself. Swithun. -- The University of St Andrews is a charity registered in Scotland: SC013532