Re: Best Practices for Converting CSV into LOD?

2010-08-13 Thread Wood, Jamey
Thanks to everyone who responded to my questions (both on this list and 
privately).  One thing I realized is that sending out my example(s) as RDF 
snippets that lacked dereferenceable URIs probably wasn't a good idea (since 
one of my core goals is to produce not just good RDF, but good RDF which is 
LOD-friendly).

So I have fleshed-out a couple of examples to incorporate some of the 
suggestions I've received and put them up as live LOD.  (They're still very 
much works in progress, though-so I do expect they'll change or disappear soon.)

They're available at:

  http://en.openei.org/lod/resource/datasets/43
  http://en.openei.org/lod/resource/datasets/43b

I've but these two samples together to try to clarify my third question (about 
making LOD browseable), which is still murkiest to me.  In the 43 example, 
the data is crafted to have a hierarchical path through the data (state - 
state/year - state/year/month - state/year/month/type_of_producer - 
state/year/month/type_of_producer/energy_source).  In the 43b example, no 
such attempt is made.   Instead, 43b links each leaf data node back to the 
root of the dataset ( /lod/resource/datasets/43b) via a 
dcterms:isReferencedBy predicate and to a URI for the associated state (e.g. 
/lod/resource/datasets/43b/AK) via a openei:datasets/43b/terms/state 
predicate.  (This state URI is then linked to DBpedia's state URI via a 
skos:closeMatch predicate.)

Thus, the 43b example would seem to be less amenable to HTML-based browsing.  
For example, note how these pages end up being overwhelming (and truncated):

  http://en.openei.org/lod/resource/datasets/43b
  http://en.openei.org/lod/resource/datasets/43b/AK

So what I'm still wondering is whether striving for a non-overwhelming HTML 
browsing experience for a given set of LOD is a worthwhile goal.  And, if so, 
is the 43 example taking a reasonable path to achieve that goal?  Or is there 
some better way?

Thanks,
Jamey

On 8/9/10 10:37 AM, Jamey Wood jamey.w...@nrel.gov wrote:

Are there any established best practices for converting CSV data into 
LOD-friendly RDF?  For example, I would like to produce an LOD-friendly RDF 
version of the 2001 - Present Net Generation by State by Type of Producer by 
Energy Source CSV data at:

  http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html

I'm attaching a sample of a first stab at this.  Questions I'm running into 
include the following:


 1.  Should one try to convert primitive data types (particularly strings) into 
URI references?  Or just leave them as primitives?  Or perhaps provide both 
(with separate predicate names)?  For example, the  sample EIA data I reference 
has two-letter state abbreviations in one column.  Should those be left alone 
or converted into URIs?
 2.  Should one merge separate columns from the original data in order to align 
to well-known RDF types?  For example, the sample EIA data has separate Year 
and Month columns.  Should those be merged in the RDF version so that an 
xs:gYearMonth type can be used?
 3.  Should one attempt to introduce some sort of hierarchical structure (to 
make the LOD more browseable)?  The skos:related triples in the attached 
sample are an initial attempt to do that.  Is this a good idea?  If so, is that 
a reasonable predicate to use?  If it is a reasonable thing to do, we would 
presumably craft these triples so that one could navigate through the entire 
LOD (e.g. state - state/year - state/year/month - 
state/year/month/typeOfProducer - 
state/year/month/typeOfProducer/energySource).
 4.  Any other considerations that I'm overlooking?

Thanks,
Jamey




Re: Best Practices for Converting CSV into LOD?

2010-08-13 Thread Daniel Schwabe
Hello,

On 13/08/2010, at 16:46, Wood, Jamey wrote:

 I've but these two samples together to try to clarify my third question 
 (about making LOD browseable), which is still murkiest to me.  In the 43 
 example, the data is crafted to have a hierarchical path through the data 
 (state - state/year - state/year/month - 
 state/year/month/type_of_producer - 
 state/year/month/type_of_producer/energy_source).  In the 43b example, no 
 such attempt is made.   Instead, 43b links each leaf data node back to the 
 root of the dataset ( /lod/resource/datasets/43b) via a 
 dcterms:isReferencedBy predicate and to a URI for the associated state 
 (e.g. /lod/resource/datasets/43b/AK) via a 
 openei:datasets/43b/terms/state predicate.  (This state URI is then linked 
 to DBpedia's state URI via a skos:closeMatch predicate.)
 
 Thus, the 43b example would seem to be less amenable to HTML-based browsing.  
 For example, note how these pages end up being overwhelming (and truncated):
 
  http://en.openei.org/lod/resource/datasets/43b
  http://en.openei.org/lod/resource/datasets/43b/AK
 
 So what I'm still wondering is whether striving for a non-overwhelming HTML 
 browsing experience for a given set of LOD is a worthwhile goal.  

My view on this is that these are two really separate issues. One is to have a 
good data structure that can be used for several purposes; another is being 
able to HTML browse. By this I take it that you mean navigating the dataset 
as if it were a hypertext, HTML-style (link at a time). 

The database community has long established the concept of External schemas 
(views)  as the way to allow special-purpose access to a common Logical 
database schema. Browsing should be regarded as one of those special purpose 
uses. I think it is not practical to expect that applications can be built by 
directly browsing the raw RDF structure. 

Direct  browsing  of the raw RDF would only be meaningful for some developers 
who may want to understand and find out what's in there, and even this is 
debatable...

I would argue that one should have a special-purpose view over the raw RDF 
data that makes it more amenable to HTML-style (i.e. hypertext) browsing, The 
RDF structure itself should not be particulary biased towards browsing.

My 2c...
---

Daniel Schwabe  Lab. Tecweb, Dept. de 
Informatica, PUC-Rio
Tel:+55-21-3527 1500 r. 4356R. M. de S. Vicente, 225
Fax: +55-21-3527 1530   Rio de Janeiro, RJ 22453-900, Brasil
http://www.inf.puc-rio.br/~dschwabe




Re: Best Practices for Converting CSV into LOD?

2010-08-10 Thread Dave Reynolds
On Mon, 2010-08-09 at 10:37 -0600, Wood, Jamey wrote: 
 Are there any established best practices for converting CSV data into 
 LOD-friendly RDF?  For example, I would like to produce an LOD-friendly RDF 
 version of the 2001 - Present Net Generation by State by Type of Producer by 
 Energy Source CSV data at:
 
   http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html
 
 I'm attaching a sample of a first stab at this.  Questions I'm running into 
 include the following:
 
 
  1.  Should one try to convert primitive data types (particularly strings) 
 into URI references?  Or just leave them as primitives?  Or perhaps provide 
 both (with separate predicate names)?  For example, the  sample EIA data I 
 reference has two-letter state abbreviations in one column.  Should those be 
 left alone or converted into URIs?

If the code corresponds to a concept which has a useful URI to link to
then yes. 

In cases where the string is a code but there isn't an existing URI
scheme then one approach is to create a set of SKOS concepts to
represent the codes, recording the original code string using
skos:notation.

 2.  Should one merge separate columns from the original data in order to 
 align to well-known RDF types?  For example, the sample EIA data has separate 
 Year and Month columns.  Should those be merged in the RDF version so 
 that an xs:gYearMonth type can be used?

Probably. Merging is useful if you are going to query via the merged
form. In a case like year/month there could be an argument for also
keeping the separate forms as well to enable you to query by month,
independent of year.

 3.  Should one attempt to introduce some sort of hierarchical structure (to 
 make the LOD more browseable)?  The skos:related triples in the attached 
 sample are an initial attempt to do that.  Is this a good idea?  If so, is 
 that a reasonable predicate to use?  If it is a reasonable thing to do, we 
 would presumably craft these triples so that one could navigate through the 
 entire LOD (e.g. state - state/year - state/year/month - 
 state/year/month/typeOfProducer - 
 state/year/month/typeOfProducer/energySource).

Another approach is to use one of the statistics-in-RDF representations
so that you can slice by the dimensions in the data.

There is the Scovo vocabulary [1]. 

Recently a group of us have been working on an updated vocabulary for
statistics [2] based on the SDMX standard [3]. At a recent Open Data
Foundation workshop [4] we agreed to partition the SDMX-in-RDF work into
a simple Data Cube vocabulary [5] and extension vocabularies to
support particular domains such as aggregate statistics (SDMX) and maybe
eventually micro-data (DDI).

The Data Cube vocabulary is very much a work in progress but I think we
have now closed out all the main open design questions, have a draft
vocab and aim to get the initial documentation to a usable state over
the coming few weeks.

Feel free to ping me off line if you would like to follow up on this.

Dave

[1] http://semanticweb.org/wiki/Scovo
[2] http://code.google.com/p/publishing-statistical-data/
[3] http://sdmx.org/
[4] http://www.odaf.org/blog/?p=39
[5]
http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html







Re: Best Practices for Converting CSV into LOD?

2010-08-09 Thread Axel Rauschmayer
I gave this a shot in a previous version of Hyena. By prepending one or more 
special rows, one could control how the columns were converted: what predicate 
to use, how to convert the content. If a column specification was missing, 
defaults were used. There were several options: If a cell value was similar to 
a tag, resources could be auto-created (the cell value became the resource 
label, existing resources were looked up via their labels). One could also 
split a cell value prior to processing it (to account for multiple values per 
column).

Creating meaningful URIs for predicates and rows (resources) is especially 
important, but tricky. Ideally, import would work bi-directionally (and 
idempotently): Changes you make in RDF can be written back to the spreadsheet, 
changes in the spreadsheet can be reimported without causing chaos.

Even though my solution worked OK and I do not see how it could be done better, 
I was not completely happy with it, because writing this kind of CSV/RDF 
mapping is beyond the capabilities of normal end users. One could automatically 
create URIs for predicates from column titles, but as for reliable URIs 
(primary keys), I am at a loss. So it seems like one is stuck with letting an 
expert write an import specification and hiding it from end users. Then my 
solution of embedding such a spec in the spreadsheet should be re-thought. And 
it seems like a simple script might be a better solution than a complex 
specification language that can handle all the special cases. For example, I 
hadn’t even thought about two cells contributing to the same literal. Maybe a 
JVM-hosted scripting language (such as Jython) could be used, but even raw Java 
is not so bad and has the advantage of superior tool support.

This is important stuff, as many people have all kinds of lists in 
Excel---which would make great LOD data. It also shows that spreadsheets are 
hard to beat when it comes to getting started quickly: You just enter your 
data. Should someone come up with a simpler way of translating CSV data then 
that might translate to general usability improvements for entering LOD data.

On Aug 9, 2010, at 18:37 , Wood, Jamey wrote:

 Are there any established best practices for converting CSV data into 
 LOD-friendly RDF?  For example, I would like to produce an LOD-friendly RDF 
 version of the 2001 - Present Net Generation by State by Type of Producer by 
 Energy Source CSV data at:
 
  http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html
 
 I'm attaching a sample of a first stab at this.  Questions I'm running into 
 include the following:
 
 
 1.  Should one try to convert primitive data types (particularly strings) 
 into URI references?  Or just leave them as primitives?  Or perhaps provide 
 both (with separate predicate names)?  For example, the  sample EIA data I 
 reference has two-letter state abbreviations in one column.  Should those be 
 left alone or converted into URIs?
 2.  Should one merge separate columns from the original data in order to 
 align to well-known RDF types?  For example, the sample EIA data has separate 
 Year and Month columns.  Should those be merged in the RDF version so 
 that an xs:gYearMonth type can be used?
 3.  Should one attempt to introduce some sort of hierarchical structure (to 
 make the LOD more browseable)?  The skos:related triples in the attached 
 sample are an initial attempt to do that.  Is this a good idea?  If so, is 
 that a reasonable predicate to use?  If it is a reasonable thing to do, we 
 would presumably craft these triples so that one could navigate through the 
 entire LOD (e.g. state - state/year - state/year/month - 
 state/year/month/typeOfProducer - 
 state/year/month/typeOfProducer/energySource).
 4.  Any other considerations that I'm overlooking?
 
 Thanks,
 Jamey
 generation_state_mon.rdf

-- 
Dr. Axel Rauschmayer
axel.rauschma...@ifi.lmu.de
http://hypergraphs.de/
### Hyena: organize your ideas, free at hypergraphs.de/hyena/






Re: Best Practices for Converting CSV into LOD?

2010-08-09 Thread Mike Bergman
You may want to look at irON [1] and its commON [2] format.  The 
specs provide guidance on our approach to your questions.


We use it all the time (as do our clients) and it works great. 
Fred Giasson also just completed a dataset append Web service 
that integrates with it for incremental updates.


Thanks, Mike

[1] http://openstructs.org/iron
[2] http://techwiki.openstructs.org/index.php/CommON_Case_Study

On 8/9/2010 2:12 PM, Axel Rauschmayer wrote:

I gave this a shot in a previous version of Hyena. By prepending one or more 
special rows, one could control how the columns were converted: what predicate 
to use, how to convert the content. If a column specification was missing, 
defaults were used. There were several options: If a cell value was similar to 
a tag, resources could be auto-created (the cell value became the resource 
label, existing resources were looked up via their labels). One could also 
split a cell value prior to processing it (to account for multiple values per 
column).

Creating meaningful URIs for predicates and rows (resources) is especially 
important, but tricky. Ideally, import would work bi-directionally (and 
idempotently): Changes you make in RDF can be written back to the spreadsheet, 
changes in the spreadsheet can be reimported without causing chaos.

Even though my solution worked OK and I do not see how it could be done better, I was not 
completely happy with it, because writing this kind of CSV/RDF mapping is beyond the 
capabilities of normal end users. One could automatically create URIs for predicates from 
column titles, but as for reliable URIs (primary keys), I am at a loss. So it 
seems like one is stuck with letting an expert write an import specification and hiding 
it from end users. Then my solution of embedding such a spec in the spreadsheet should be 
re-thought. And it seems like a simple script might be a better solution than a complex 
specification language that can handle all the special cases. For example, I hadn’t even 
thought about two cells contributing to the same literal. Maybe a JVM-hosted scripting 
language (such as Jython) could be used, but even raw Java is not so bad and has the 
advantage of superior tool support.

This is important stuff, as many people have all kinds of lists in 
Excel---which would make great LOD data. It also shows that spreadsheets are 
hard to beat when it comes to getting started quickly: You just enter your 
data. Should someone come up with a simpler way of translating CSV data then 
that might translate to general usability improvements for entering LOD data.

On Aug 9, 2010, at 18:37 , Wood, Jamey wrote:


Are there any established best practices for converting CSV data into LOD-friendly RDF?  
For example, I would like to produce an LOD-friendly RDF version of the 2001 - 
Present Net Generation by State by Type of Producer by Energy Source CSV data at:

  http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html

I'm attaching a sample of a first stab at this.  Questions I'm running into 
include the following:


1.  Should one try to convert primitive data types (particularly strings) into 
URI references?  Or just leave them as primitives?  Or perhaps provide both 
(with separate predicate names)?  For example, the  sample EIA data I reference 
has two-letter state abbreviations in one column.  Should those be left alone 
or converted into URIs?
2.  Should one merge separate columns from the original data in order to align to well-known RDF types?  For 
example, the sample EIA data has separate Year and Month columns.  Should those be 
merged in the RDF version so that an xs:gYearMonth type can be used?
3.  Should one attempt to introduce some sort of hierarchical structure (to make the LOD more browseable)?  The skos:related triples in 
the attached sample are an initial attempt to do that.  Is this a good idea?  If so, is that a reasonable predicate to use?  If it is a reasonable thing to do, 
we would presumably craft these triples so that one could navigate through the entire LOD (e.g. state -  state/year -  
state/year/month -  state/year/month/typeOfProducer -  state/year/month/typeOfProducer/energySource).
4.  Any other considerations that I'm overlooking?

Thanks,
Jamey
generation_state_mon.rdf