On Fri, Sep 26, 2014 at 10:12:24AM -0400, Peter Dietz wrote: > Hi Pablo, > > Ideally we'd like to be validating that dc.date.* are all valid ISO8601 > dates, but using non-standard values is quite widespread. dc.date.issued: > WWI, or Circa 1950, or Sept 2014, or 2014 Summer, or Unknown.
I think it's optimistic to suppose that we could always get an ISO 8601 date here. Those odd values contain information that a day-date cannot represent. Aside from fixing the bug (unparseable dates lead to null values which then make code which expects dates sick), we need to consider how date metadata are used. Some need to be convertible to some kind of common representation of time bindings, but others may not. Those on which DSpace does not operate, and whose formats are not defined by external metadata standards, can probably remain free-form. For the rest... Given some time, we ought to build something that can understand reasonable values here. I see several classes that must be addressed: o Imprecise dates: Sep-2014, 20th Century. These have calculable limits but may need special-purpose parsing. They represent definite intervals and can be matched without much trouble. o Fuzzy dates: c. 1969, before 1492. We might represent these as intervals with a "fuzzy bit" to indicate uncertainty: they cannot really participate in comparison without defining just how much fuzz is acceptable for a match. o Named events and intervals: WWI. There are probably too many variations for DSpace to handle these out-of-the-box. Maybe we can provide room for pluggable parsing and comparison modules, to recognize these as convertible to our hypothetical common representation and to do appropriate matching. o Unknown. We should define one or two external representations of this and require conformance to them if you really want to enter an explicitly unknown date. Or perhaps anything not parseable should be filed as "unknown", with a warning on the submission form as you suggest. It might be best to think of these values as sentences in a small language as we design recognizers and manipulators for them. The "common representation" is probably an object containing a pair of Dates (an interval), the "fuzzy bit", the "unknown bit", perhaps some information about how it was understood, and methods for manipulating these complex beasties. There is already some code (DCDate) to deal with imprecise dates, but it may not yet be used everywhere we need it. OTOH if the language model fits well, the best common representation might be a parse tree. Has anyone formally studied the kinds of date information that repositories deal with? What they actually *need*, that is, not the *interesting* things that people come up with when they have no standard for dealing with novel problems. :-) -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu
signature.asc
Description: Digital signature
------------------------------------------------------------------------------ Slashdot TV. Videos for Nerds. Stuff that Matters. http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

