dcc-associates  

Re: [dcc-associates] expressing certainty in PREMIS

Priscilla Caplan
Thu, 15 Jul 2010 07:49:57 -0700

The PREMIS Editorial Committee fairly recently discussed the situation where more than one format legitimately applied to a file. Previously, the Usage instructions said only to record the most specific format. These were revised to say "If a file or bitstream conforms to more than one format of equal specificity, each should be recorded in separate /format/ containers." The full text of the Usage note for formatDesignation is copied below.


Priscilla

Either /formatDesignation/ or at least one instance of /formatRegistry/ is required.

The most specific format (or format profile) should be recorded. A repository (or formats registry) may wish to use multipart format names (e.g., "TIFF_GeoTIFF" or "WAVE_MPEG_BWF") to achieve this specificity.

For any given file or bitstream, the most specific format identified by the repository should be recorded. A restricted or modified version of a format is considered more specific than the format; for example, GeoTIFF is more specific than TIFF; BWF is more specific than WAVE.

If a file or bitstream conforms to more than one format of equal specificity, each should be recorded in separate /format/ containers.

On 7/15/2010 9:19 AM, Tim DiLauro wrote:
On Jul 14, 2010, at 12:41 PM, David Rosenthal wrote:

This means that running these tools,  remembering their results,  and
using those results at a later time is a very bad idea.  If the
information is needed at a later time,  the tools should be re-run.
And the information should be used with the knowledge that some of the
results at any given time will be wrong.
But running these tools repeatedly over large amounts of data is expensive.  
Finding ways to reduce this need would be useful.

One approach to consider would be keeping the *multiple* results of each of the various tools 
rather than (or, perhaps, in addition to) the *unified* result of all of them.  Associated with 
each of these results would be some identification of the source (tools, versions of particular 
formats, etc.).  These data could be evaluated during preservation evaluation processing by looking 
one level deeper when asking the usual question: "Is this format at risk." Instead of 
stopping there, we could start with the question: "Is this version of the data about this 
format or from this tool invalidated or questionable?"  If so, then it should be marked as 
such and a corrected version generated, if possible.  My language here is rather imprecise, but I 
hope the point is coming through.

Perhaps the http://code.google.com/p/fits/ tool -- which I agree should be 
renamed to avoid confusion with FITS *format* -- could be modified to perform 
such a re-evaluation/validation and possibly reuse this already captured data.

~Tim