On Mon, 2010-08-09 at 10:37 -0600, Wood, Jamey wrote: 
> Are there any established best practices for converting CSV data into 
> LOD-friendly RDF?  For example, I would like to produce an LOD-friendly RDF 
> version of the "2001 - Present Net Generation by State by Type of Producer by 
> Energy Source" CSV data at:
> 
>   http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html
> 
> I'm attaching a sample of a first stab at this.  Questions I'm running into 
> include the following:
> 
> 
>  1.  Should one try to convert primitive data types (particularly strings) 
> into URI references?  Or just leave them as primitives?  Or perhaps provide 
> both (with separate predicate names)?  For example, the  sample EIA data I 
> reference has two-letter state abbreviations in one column.  Should those be 
> left alone or converted into URIs?

If the code corresponds to a concept which has a useful URI to link to
then "yes". 

In cases where the string is a code but there isn't an existing URI
scheme then one approach is to create a set of SKOS concepts to
represent the codes, recording the original code string using
skos:notation.

> 2.  Should one merge separate columns from the original data in order to 
> align to well-known RDF types?  For example, the sample EIA data has separate 
> "Year" and "Month" columns.  Should those be merged in the RDF version so 
> that an "xs:gYearMonth" type can be used?

Probably. Merging is useful if you are going to query via the merged
form. In a case like year/month there could be an argument for also
keeping the separate forms as well to enable you to query by month,
independent of year.

> 3.  Should one attempt to introduce some sort of hierarchical structure (to 
> make the LOD more "browseable")?  The "skos:related" triples in the attached 
> sample are an initial attempt to do that.  Is this a good idea?  If so, is 
> that a reasonable predicate to use?  If it is a reasonable thing to do, we 
> would presumably craft these triples so that one could navigate through the 
> entire LOD (e.g. "state" -> "state/year" -> "state/year/month" -> 
> "state/year/month/typeOfProducer" -> 
> "state/year/month/typeOfProducer/energySource").

Another approach is to use one of the statistics-in-RDF representations
so that you can slice by the dimensions in the data.

There is the Scovo vocabulary [1]. 

Recently a group of us have been working on an updated vocabulary for
statistics [2] based on the SDMX standard [3]. At a recent Open Data
Foundation workshop [4] we agreed to partition the SDMX-in-RDF work into
a simple "Data Cube" vocabulary [5] and extension vocabularies to
support particular domains such as aggregate statistics (SDMX) and maybe
eventually micro-data (DDI).

The Data Cube vocabulary is very much a work in progress but I think we
have now closed out all the main open design questions, have a draft
vocab and aim to get the initial documentation to a usable state over
the coming few weeks.

Feel free to ping me off line if you would like to follow up on this.

Dave

[1] http://semanticweb.org/wiki/Scovo
[2] http://code.google.com/p/publishing-statistical-data/
[3] http://sdmx.org/
[4] http://www.odaf.org/blog/?p=39
[5]
http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html





Reply via email to