I have created an initial pile of RDF, mostly.

I am in the process of experimenting with linked data for archives. My goal is 
to use existing (EAD and MARC) metadata to create RDF/XML, and then to expose 
this RDF/XML using linked data principles. Once I get that far I hope to slurp 
up the RDF/XML into a triple store, analyse the data, and learn how the whole 
process could be improved. 

This is what I have done to date:

  * accumulated sets of EAD files and MARC
    records

  * identified and cached a few XSL stylesheets
    transforming EAD and MARCXML into RDF/XML

  * wrote a couple of Perl script that combine
    Bullet #1 and Bullet #2 to create HTML and
    RDF/XML

  * write a mod_perl module implementing
    rudimentary content negotiation

  * made the whole thing (scripts, sets of data,
    HTML, RDF/XML, etc.) available on the Web

You can see the fruits of these labors at http://infomotions.com/sandbox/liam/, 
and there you will find a few directories:

  * bin - my Perl scripts live here as well as
    a couple of support files

  * data - full of RDF/XML files -- about 4,000
    of them

  * etc - mostly stylesheets

  * id - a placeholder for the URIs and content
    negotiation

  * lib - where the actual content negotiation
    script lives

  * pages - HTML versions of the original metadata

  * src - a cache for my original metadata

  * tmp - things of brief importance; mostly trash

My Perl scripts read the metadata, create HTML and RDF/XML, and save the result 
in the pages and data directories, respectively. A person can browse these 
directories, but browsing will be difficult because there is nothing there 
except cryptic file names. Selecting any of the files should return valid HTML 
or RDF/XML. 

Each cryptic name is the leaf of a URI prefixed with 
"http://infomotions.com/sandbox/liam/id/";. For example, if the leaf is 
"mshm510", then the combined leaf and prefix form a resolvable URI -- 
http://infomotions.com/sandbox/liam/id/mshm510. When user-agent says it can 
accept text/html, then the HTTP server redirects the user-agent to 
http://infomotions.com/sandbox/liam/pages/mshm510.html. If the user agent does 
not request a text/html representation, then the RDF/XML version is returned -- 
http://infomotions.com/sandbox/liam/data/mshm510.rdf. This is rudimentary 
content-negotiation. For a good time, here are a few actionable URIs:

  * http://infomotions.com/sandbox/liam/id/4042gwbo
  * http://infomotions.com/sandbox/liam/id/httphdllocgovlocmusiceadmusmu004002
  * http://infomotions.com/sandbox/liam/id/ma117
  * http://infomotions.com/sandbox/liam/id/mshm509
  * http://infomotions.com/sandbox/liam/id/stcmarcocm11422551
  * http://infomotions.com/sandbox/liam/id/vilmarcvil_155543

For a good time, feed them to the W3C RDF Validator. 

The next step is to figure out how to handle file not found errors when a URI 
does not exist. Another thing to figure out is how to make potential robots 
aware of the data set. The bigger problem is to simply make the dataset more 
meaningful the the inclusion of more URIs in the RDF/XML as well as the use of 
a more consistent and standardized set of ontologies. 

Fun with linked data?

— 
Eric Morgan

Reply via email to