On Mon, 2010-08-09 at 10:37 -0600, Wood, Jamey wrote: > Are there any established best practices for converting CSV data into > LOD-friendly RDF? For example, I would like to produce an LOD-friendly RDF > version of the "2001 - Present Net Generation by State by Type of Producer by > Energy Source" CSV data at: > > http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html > > I'm attaching a sample of a first stab at this. Questions I'm running into > include the following: > > > 1. Should one try to convert primitive data types (particularly strings) > into URI references? Or just leave them as primitives? Or perhaps provide > both (with separate predicate names)? For example, the sample EIA data I > reference has two-letter state abbreviations in one column. Should those be > left alone or converted into URIs?
If the code corresponds to a concept which has a useful URI to link to then "yes". In cases where the string is a code but there isn't an existing URI scheme then one approach is to create a set of SKOS concepts to represent the codes, recording the original code string using skos:notation. > 2. Should one merge separate columns from the original data in order to > align to well-known RDF types? For example, the sample EIA data has separate > "Year" and "Month" columns. Should those be merged in the RDF version so > that an "xs:gYearMonth" type can be used? Probably. Merging is useful if you are going to query via the merged form. In a case like year/month there could be an argument for also keeping the separate forms as well to enable you to query by month, independent of year. > 3. Should one attempt to introduce some sort of hierarchical structure (to > make the LOD more "browseable")? The "skos:related" triples in the attached > sample are an initial attempt to do that. Is this a good idea? If so, is > that a reasonable predicate to use? If it is a reasonable thing to do, we > would presumably craft these triples so that one could navigate through the > entire LOD (e.g. "state" -> "state/year" -> "state/year/month" -> > "state/year/month/typeOfProducer" -> > "state/year/month/typeOfProducer/energySource"). Another approach is to use one of the statistics-in-RDF representations so that you can slice by the dimensions in the data. There is the Scovo vocabulary [1]. Recently a group of us have been working on an updated vocabulary for statistics [2] based on the SDMX standard [3]. At a recent Open Data Foundation workshop [4] we agreed to partition the SDMX-in-RDF work into a simple "Data Cube" vocabulary [5] and extension vocabularies to support particular domains such as aggregate statistics (SDMX) and maybe eventually micro-data (DDI). The Data Cube vocabulary is very much a work in progress but I think we have now closed out all the main open design questions, have a draft vocab and aim to get the initial documentation to a usable state over the coming few weeks. Feel free to ping me off line if you would like to follow up on this. Dave [1] http://semanticweb.org/wiki/Scovo [2] http://code.google.com/p/publishing-statistical-data/ [3] http://sdmx.org/ [4] http://www.odaf.org/blog/?p=39 [5] http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html