dcc-associates  

Re: [dcc-associates] expressing certainty in PREMIS

Swithun Crowe
Wed, 14 Jul 2010 03:19:09 -0700

Hello

SH> Swithun, Interesting effect of using multiple format ID tools in a 
SH>bundled arrangement such as FITS. Presumably we have to understand and 
SH>rectify the inconsistencies rather than just record them. What degree 
SH>of disagreement have you found?

>From the files I've been testing FITS with, a lot of the conflicting 
identifications come from .doc documents. Some are RTF documents in 
disguise, which can be picked up as RTF or as plain text. Others are 
picked up as Word documents or as Excel documents. When DROID can't 
determine the version of a .doc, FITS records this as a partial 
identification, so DROID's contribution is not used.

Ideally the job of improving identification and using standard identifiers 
would be pushed back to the tools that FITS uses. This probably happens 
already to some extent. And FITS is getting better all the time at 
standardising the output from tools and producing fewer false negatives.

A good identification is when several tools agree and none disagree. The 
other situations are where only one tool produces an identification, 
several tools produce conflicting identifications, or no tool produces a 
full identification. FITS labels these negative results as 
'SINGLE_RESULT', 'CONFLICT' and 'PARTIAL'.

I'm looking at the FITS code to see how to reduce the number of false 
negatives further. But there never will be perfect identification, and 
this won't bother people. So some inconsistencies and indications of 
uncertainty will just need to be recorded rather than resolved (or lost).

One reply I had said that the PREMIS standard might evolve in the future 
to include indications of certainty. At the moment, there is nothing wrong 
with extending the current PREMIS standard oneself.

Swithun.

--
The University of St Andrews is a charity registered in Scotland: SC013532