Re: Best Practices for Converting CSV into LOD?
Thanks to everyone who responded to my questions (both on this list and privately). One thing I realized is that sending out my example(s) as RDF snippets that lacked dereferenceable URIs probably wasn't a good idea (since one of my core goals is to produce not just good RDF, but good RDF which is LOD-friendly). So I have fleshed-out a couple of examples to incorporate some of the suggestions I've received and put them up as live LOD. (They're still very much works in progress, though-so I do expect they'll change or disappear soon.) They're available at: http://en.openei.org/lod/resource/datasets/43 http://en.openei.org/lod/resource/datasets/43b I've but these two samples together to try to clarify my third question (about making LOD browseable), which is still murkiest to me. In the 43 example, the data is crafted to have a hierarchical path through the data (state - state/year - state/year/month - state/year/month/type_of_producer - state/year/month/type_of_producer/energy_source). In the 43b example, no such attempt is made. Instead, 43b links each leaf data node back to the root of the dataset ( /lod/resource/datasets/43b) via a dcterms:isReferencedBy predicate and to a URI for the associated state (e.g. /lod/resource/datasets/43b/AK) via a openei:datasets/43b/terms/state predicate. (This state URI is then linked to DBpedia's state URI via a skos:closeMatch predicate.) Thus, the 43b example would seem to be less amenable to HTML-based browsing. For example, note how these pages end up being overwhelming (and truncated): http://en.openei.org/lod/resource/datasets/43b http://en.openei.org/lod/resource/datasets/43b/AK So what I'm still wondering is whether striving for a non-overwhelming HTML browsing experience for a given set of LOD is a worthwhile goal. And, if so, is the 43 example taking a reasonable path to achieve that goal? Or is there some better way? Thanks, Jamey On 8/9/10 10:37 AM, Jamey Wood jamey.w...@nrel.gov wrote: Are there any established best practices for converting CSV data into LOD-friendly RDF? For example, I would like to produce an LOD-friendly RDF version of the 2001 - Present Net Generation by State by Type of Producer by Energy Source CSV data at: http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html I'm attaching a sample of a first stab at this. Questions I'm running into include the following: 1. Should one try to convert primitive data types (particularly strings) into URI references? Or just leave them as primitives? Or perhaps provide both (with separate predicate names)? For example, the sample EIA data I reference has two-letter state abbreviations in one column. Should those be left alone or converted into URIs? 2. Should one merge separate columns from the original data in order to align to well-known RDF types? For example, the sample EIA data has separate Year and Month columns. Should those be merged in the RDF version so that an xs:gYearMonth type can be used? 3. Should one attempt to introduce some sort of hierarchical structure (to make the LOD more browseable)? The skos:related triples in the attached sample are an initial attempt to do that. Is this a good idea? If so, is that a reasonable predicate to use? If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. state - state/year - state/year/month - state/year/month/typeOfProducer - state/year/month/typeOfProducer/energySource). 4. Any other considerations that I'm overlooking? Thanks, Jamey
Re: Best Practices for Converting CSV into LOD?
Hello, On 13/08/2010, at 16:46, Wood, Jamey wrote: I've but these two samples together to try to clarify my third question (about making LOD browseable), which is still murkiest to me. In the 43 example, the data is crafted to have a hierarchical path through the data (state - state/year - state/year/month - state/year/month/type_of_producer - state/year/month/type_of_producer/energy_source). In the 43b example, no such attempt is made. Instead, 43b links each leaf data node back to the root of the dataset ( /lod/resource/datasets/43b) via a dcterms:isReferencedBy predicate and to a URI for the associated state (e.g. /lod/resource/datasets/43b/AK) via a openei:datasets/43b/terms/state predicate. (This state URI is then linked to DBpedia's state URI via a skos:closeMatch predicate.) Thus, the 43b example would seem to be less amenable to HTML-based browsing. For example, note how these pages end up being overwhelming (and truncated): http://en.openei.org/lod/resource/datasets/43b http://en.openei.org/lod/resource/datasets/43b/AK So what I'm still wondering is whether striving for a non-overwhelming HTML browsing experience for a given set of LOD is a worthwhile goal. My view on this is that these are two really separate issues. One is to have a good data structure that can be used for several purposes; another is being able to HTML browse. By this I take it that you mean navigating the dataset as if it were a hypertext, HTML-style (link at a time). The database community has long established the concept of External schemas (views) as the way to allow special-purpose access to a common Logical database schema. Browsing should be regarded as one of those special purpose uses. I think it is not practical to expect that applications can be built by directly browsing the raw RDF structure. Direct browsing of the raw RDF would only be meaningful for some developers who may want to understand and find out what's in there, and even this is debatable... I would argue that one should have a special-purpose view over the raw RDF data that makes it more amenable to HTML-style (i.e. hypertext) browsing, The RDF structure itself should not be particulary biased towards browsing. My 2c... --- Daniel Schwabe Lab. Tecweb, Dept. de Informatica, PUC-Rio Tel:+55-21-3527 1500 r. 4356R. M. de S. Vicente, 225 Fax: +55-21-3527 1530 Rio de Janeiro, RJ 22453-900, Brasil http://www.inf.puc-rio.br/~dschwabe
Re: Best Practices for Converting CSV into LOD?
On Mon, 2010-08-09 at 10:37 -0600, Wood, Jamey wrote: Are there any established best practices for converting CSV data into LOD-friendly RDF? For example, I would like to produce an LOD-friendly RDF version of the 2001 - Present Net Generation by State by Type of Producer by Energy Source CSV data at: http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html I'm attaching a sample of a first stab at this. Questions I'm running into include the following: 1. Should one try to convert primitive data types (particularly strings) into URI references? Or just leave them as primitives? Or perhaps provide both (with separate predicate names)? For example, the sample EIA data I reference has two-letter state abbreviations in one column. Should those be left alone or converted into URIs? If the code corresponds to a concept which has a useful URI to link to then yes. In cases where the string is a code but there isn't an existing URI scheme then one approach is to create a set of SKOS concepts to represent the codes, recording the original code string using skos:notation. 2. Should one merge separate columns from the original data in order to align to well-known RDF types? For example, the sample EIA data has separate Year and Month columns. Should those be merged in the RDF version so that an xs:gYearMonth type can be used? Probably. Merging is useful if you are going to query via the merged form. In a case like year/month there could be an argument for also keeping the separate forms as well to enable you to query by month, independent of year. 3. Should one attempt to introduce some sort of hierarchical structure (to make the LOD more browseable)? The skos:related triples in the attached sample are an initial attempt to do that. Is this a good idea? If so, is that a reasonable predicate to use? If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. state - state/year - state/year/month - state/year/month/typeOfProducer - state/year/month/typeOfProducer/energySource). Another approach is to use one of the statistics-in-RDF representations so that you can slice by the dimensions in the data. There is the Scovo vocabulary [1]. Recently a group of us have been working on an updated vocabulary for statistics [2] based on the SDMX standard [3]. At a recent Open Data Foundation workshop [4] we agreed to partition the SDMX-in-RDF work into a simple Data Cube vocabulary [5] and extension vocabularies to support particular domains such as aggregate statistics (SDMX) and maybe eventually micro-data (DDI). The Data Cube vocabulary is very much a work in progress but I think we have now closed out all the main open design questions, have a draft vocab and aim to get the initial documentation to a usable state over the coming few weeks. Feel free to ping me off line if you would like to follow up on this. Dave [1] http://semanticweb.org/wiki/Scovo [2] http://code.google.com/p/publishing-statistical-data/ [3] http://sdmx.org/ [4] http://www.odaf.org/blog/?p=39 [5] http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html
Re: Best Practices for Converting CSV into LOD?
I gave this a shot in a previous version of Hyena. By prepending one or more special rows, one could control how the columns were converted: what predicate to use, how to convert the content. If a column specification was missing, defaults were used. There were several options: If a cell value was similar to a tag, resources could be auto-created (the cell value became the resource label, existing resources were looked up via their labels). One could also split a cell value prior to processing it (to account for multiple values per column). Creating meaningful URIs for predicates and rows (resources) is especially important, but tricky. Ideally, import would work bi-directionally (and idempotently): Changes you make in RDF can be written back to the spreadsheet, changes in the spreadsheet can be reimported without causing chaos. Even though my solution worked OK and I do not see how it could be done better, I was not completely happy with it, because writing this kind of CSV/RDF mapping is beyond the capabilities of normal end users. One could automatically create URIs for predicates from column titles, but as for reliable URIs (primary keys), I am at a loss. So it seems like one is stuck with letting an expert write an import specification and hiding it from end users. Then my solution of embedding such a spec in the spreadsheet should be re-thought. And it seems like a simple script might be a better solution than a complex specification language that can handle all the special cases. For example, I hadn’t even thought about two cells contributing to the same literal. Maybe a JVM-hosted scripting language (such as Jython) could be used, but even raw Java is not so bad and has the advantage of superior tool support. This is important stuff, as many people have all kinds of lists in Excel---which would make great LOD data. It also shows that spreadsheets are hard to beat when it comes to getting started quickly: You just enter your data. Should someone come up with a simpler way of translating CSV data then that might translate to general usability improvements for entering LOD data. On Aug 9, 2010, at 18:37 , Wood, Jamey wrote: Are there any established best practices for converting CSV data into LOD-friendly RDF? For example, I would like to produce an LOD-friendly RDF version of the 2001 - Present Net Generation by State by Type of Producer by Energy Source CSV data at: http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html I'm attaching a sample of a first stab at this. Questions I'm running into include the following: 1. Should one try to convert primitive data types (particularly strings) into URI references? Or just leave them as primitives? Or perhaps provide both (with separate predicate names)? For example, the sample EIA data I reference has two-letter state abbreviations in one column. Should those be left alone or converted into URIs? 2. Should one merge separate columns from the original data in order to align to well-known RDF types? For example, the sample EIA data has separate Year and Month columns. Should those be merged in the RDF version so that an xs:gYearMonth type can be used? 3. Should one attempt to introduce some sort of hierarchical structure (to make the LOD more browseable)? The skos:related triples in the attached sample are an initial attempt to do that. Is this a good idea? If so, is that a reasonable predicate to use? If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. state - state/year - state/year/month - state/year/month/typeOfProducer - state/year/month/typeOfProducer/energySource). 4. Any other considerations that I'm overlooking? Thanks, Jamey generation_state_mon.rdf -- Dr. Axel Rauschmayer axel.rauschma...@ifi.lmu.de http://hypergraphs.de/ ### Hyena: organize your ideas, free at hypergraphs.de/hyena/
Re: Best Practices for Converting CSV into LOD?
You may want to look at irON [1] and its commON [2] format. The specs provide guidance on our approach to your questions. We use it all the time (as do our clients) and it works great. Fred Giasson also just completed a dataset append Web service that integrates with it for incremental updates. Thanks, Mike [1] http://openstructs.org/iron [2] http://techwiki.openstructs.org/index.php/CommON_Case_Study On 8/9/2010 2:12 PM, Axel Rauschmayer wrote: I gave this a shot in a previous version of Hyena. By prepending one or more special rows, one could control how the columns were converted: what predicate to use, how to convert the content. If a column specification was missing, defaults were used. There were several options: If a cell value was similar to a tag, resources could be auto-created (the cell value became the resource label, existing resources were looked up via their labels). One could also split a cell value prior to processing it (to account for multiple values per column). Creating meaningful URIs for predicates and rows (resources) is especially important, but tricky. Ideally, import would work bi-directionally (and idempotently): Changes you make in RDF can be written back to the spreadsheet, changes in the spreadsheet can be reimported without causing chaos. Even though my solution worked OK and I do not see how it could be done better, I was not completely happy with it, because writing this kind of CSV/RDF mapping is beyond the capabilities of normal end users. One could automatically create URIs for predicates from column titles, but as for reliable URIs (primary keys), I am at a loss. So it seems like one is stuck with letting an expert write an import specification and hiding it from end users. Then my solution of embedding such a spec in the spreadsheet should be re-thought. And it seems like a simple script might be a better solution than a complex specification language that can handle all the special cases. For example, I hadn’t even thought about two cells contributing to the same literal. Maybe a JVM-hosted scripting language (such as Jython) could be used, but even raw Java is not so bad and has the advantage of superior tool support. This is important stuff, as many people have all kinds of lists in Excel---which would make great LOD data. It also shows that spreadsheets are hard to beat when it comes to getting started quickly: You just enter your data. Should someone come up with a simpler way of translating CSV data then that might translate to general usability improvements for entering LOD data. On Aug 9, 2010, at 18:37 , Wood, Jamey wrote: Are there any established best practices for converting CSV data into LOD-friendly RDF? For example, I would like to produce an LOD-friendly RDF version of the 2001 - Present Net Generation by State by Type of Producer by Energy Source CSV data at: http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html I'm attaching a sample of a first stab at this. Questions I'm running into include the following: 1. Should one try to convert primitive data types (particularly strings) into URI references? Or just leave them as primitives? Or perhaps provide both (with separate predicate names)? For example, the sample EIA data I reference has two-letter state abbreviations in one column. Should those be left alone or converted into URIs? 2. Should one merge separate columns from the original data in order to align to well-known RDF types? For example, the sample EIA data has separate Year and Month columns. Should those be merged in the RDF version so that an xs:gYearMonth type can be used? 3. Should one attempt to introduce some sort of hierarchical structure (to make the LOD more browseable)? The skos:related triples in the attached sample are an initial attempt to do that. Is this a good idea? If so, is that a reasonable predicate to use? If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. state - state/year - state/year/month - state/year/month/typeOfProducer - state/year/month/typeOfProducer/energySource). 4. Any other considerations that I'm overlooking? Thanks, Jamey generation_state_mon.rdf