On Fri, Sep 26, 2014 at 10:12:24AM -0400, Peter Dietz wrote:
> Hi Pablo,
> 
> Ideally we'd like to be validating that dc.date.* are all valid ISO8601
> dates, but using non-standard values is quite widespread. dc.date.issued:
> WWI, or Circa 1950, or Sept 2014, or 2014 Summer, or Unknown.

I think it's optimistic to suppose that we could always get an ISO
8601 date here.  Those odd values contain information that a day-date
cannot represent.

Aside from fixing the bug (unparseable dates lead to null values which
then make code which expects dates sick), we need to consider how date
metadata are used.  Some need to be convertible to some kind of common
representation of time bindings, but others may not.  Those on which
DSpace does not operate, and whose formats are not defined by external
metadata standards, can probably remain free-form.  For the rest...

Given some time, we ought to build something that can understand
reasonable values here.  I see several classes that must be addressed:

o  Imprecise dates:  Sep-2014, 20th Century.  These have calculable
   limits but may need special-purpose parsing.  They represent
   definite intervals and can be matched without much trouble.

o  Fuzzy dates:  c. 1969, before 1492.  We might represent these as
   intervals with a "fuzzy bit" to indicate uncertainty:  they cannot
   really participate in comparison without defining just how much
   fuzz is acceptable for a match.

o  Named events and intervals:  WWI.  There are probably too many
   variations for DSpace to handle these out-of-the-box.  Maybe we can
   provide room for pluggable parsing and comparison modules, to
   recognize these as convertible to our hypothetical common
   representation and to do appropriate matching.

o  Unknown.  We should define one or two external representations of
   this and require conformance to them if you really want to enter an
   explicitly unknown date.  Or perhaps anything not parseable should
   be filed as "unknown", with a warning on the submission form as you
   suggest.

It might be best to think of these values as sentences in a small
language as we design recognizers and manipulators for them.

The "common representation" is probably an object containing a pair of
Dates (an interval), the "fuzzy bit", the "unknown bit", perhaps some
information about how it was understood, and methods for manipulating
these complex beasties.  There is already some code (DCDate) to deal
with imprecise dates, but it may not yet be used everywhere we need
it.  OTOH if the language model fits well, the best common
representation might be a parse tree.

Has anyone formally studied the kinds of date information that
repositories deal with?  What they actually *need*, that is, not the
*interesting* things that people come up with when they have no
standard for dealing with novel problems. :-)

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

Attachment: signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Slashdot TV.  Videos for Nerds.  Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to