[ 
https://issues.apache.org/jira/browse/JENA-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935035#comment-13935035
 ] 

Andy Seaborne commented on JENA-625:
------------------------------------

h1. Data Tables for SPARQL

This project is about getting CSVs into a form that is amenable to SPARQL 
processing, and doing so in a way that is not specific to CSV files.  The 
project includes getting the right architecture in place for regular table 
shaped data.  The core abstraction is the PropertyTable.

A PropertyTable is collection of data that is sufficiently regular in shape it 
can be treated as a table.  That means each subject has a value for each one of 
the set of properties.  Irregularity in terms of missing values needs to be 
handeled but not multiple values for the same property.

With special storage, a PropertyTable 

* is more compact and more amenable to custom storage (e.g. a JSON document 
store)
* can have custom indexes on specific columns
* can guarantee access orders 

Providing these features is out of scope of the project but the architecture of 
the work must be mindful of these possibilities.

This project will involve basic mapping of CSV to RDF using a fixed algorithm, 
including interpreting data as numbers or strings.  The project is not 
attempting a fully configurable, template or rules based translation of CSV to 
RDF.  The W3C CSV-WG will working on a standard version of that and will not 
deliver a sufficient stable spec in the timeframe of the project.

(Background: see R2RML Direct mapping http://www.w3.org/TR/rdb-direct-mapping/)

h2. Example

Suppose we have a CSV file:
{noformat}
Town,Population
Southton,123000
Northville,654000
{noformat}

which has one header row, two data rows

As RDF this might be viewable as:

{noformat}
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

[ csv:row 1 ; :Town "Southton"   ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
{noformat}

or without the bnode abbreviation:
{noformat}
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

_:b0  csv:row 1 ;
      :Town "Southton" ;
      :Population 123000 .

_:b1  csv:row 2 ;
      :Town "Northville" ;
      :Population 654000 .
{noformat}


Each row is modelling one "entity" (here, a population observation). There is a 
subject (a blank node) and one predicate-value for each cell of the row.  Row 
numbers are added because it can be important. 

Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be used. 
 Multiple CSVs files can be multiple graphs in one dataset to give query across 
different data sources.

{noformat}
# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
    ?x :Town ?townName ;
       :Popuation ?pop .
    FILTER(?pop > 500000)
  }
} 
{noformat}

Like database views, this is the abstraction the application sees; it may be 
stored internally in a different way.

(Example - out-of-scope - If the property table is in a external store (a 
KeyValue store or Lucene or a JSON-ish document store (e.g.), then this BGP 
processing will be much faster).

h2. Work Items

Notes:

* not all work items are the same length.
* work item do not have to done in this order

h3. Phase 1 : Architecture and System

h4. Parse CSV

Code to take a CSV file, and emits 

This needs to be 
* a robust process with good error messages.
* stream based

Jena already has a CSV parser (CSVInputIterator) specifically for SPARQL 
Results in CSV but the [Apache Commons CSV|http://commons.apache.org/csv/‎] 
parser is more flexible.  However, it is a pull-parser.  

[This 
CSVParser|https://github.com/afs/AFS-Dev/blob/master/src/main/java/lib/CSVParser.java]
 is a push-parser, taking a stream destination for output (this would change 
from {{Sink<>}} to {{StreamRDF}}).  This can be incorporating into the project, 
with improvements, if it is the right processing style.

Lots of tests.

h4. Design {{PropertyTable}}

A table of RDF terms (in Jena, RDF term is called a {{Node}}).

* {{PropertyTable}} interface (read-only)
    Get by subject
    Get by property-value
    Based on 
    (it's a document database!)
* {{PropertyTableBuilder}} interface
* {{PropertyTableImpl}} using 

h4. CSV to PropertyTable.

Using the CSV processor that generates tuples of RDF terms, add code to take a 
tabular CSV file, and create a {{PropertyTable}} using {{PropertyTableBuilder}}

To support testing, there should be RDF tuples to {{PropertyTable}} path as 
well.
See [RDF Tuple 
I/O|https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/].
  
h4. {{GraphPropertyTable}}

Implement the {{Graph}} interface (readonly) over a {{PropertyTable}}.
This is subclass from {{GraphBase}} and implements {{find()}}.
{{find()}} needs to choose the access route based on the find arguments.

This will offer the {{PropertyTable}} interface, or an appropriate subset or 
variation, so that such a graph can be treated in a more table-like fashion.

h4. Wire up

At this point, it should be all work - just create a {{GraphPropertyTable}} 
from a CSV file, and pass to SPARQL in the normal way.  Better ways to access 
the property table come later.

h4. Documentation

h4. Announce

h3. Phase 2 : Additional Features

h4. RIOT reader for CSV files.

Add {{.csv}} to RIOT so that {{model.read}} will work.  Note that there is an 
impedance mismatch here because for RDF data, the interface is "add triple" so 
the CSv reader will need to be aware of whether the destination is a 
{{GraphPropertyTable}} or a general {{Graph}} in which case the RDF triples are 
created for each row and inserted.

h4. CSV->RDF tool.

Using the parser framework developed in , direct CSv to formatted RDF syntax, 
with no intermediary graph or property table.  Create a command line tool that 
runs this so we have scalable CSV -> RDF for the direct mapping style used.

Schema driven and customizable conversion is out of scope for the project.

h4. Documentation

h4. OpExecutor to work with OpGraph/OpBGP.

While access from SPARQL via {{Graph.find}} will work, it's not ideal. This 
work item involves processing a SPARQL basic graph pattern (See 
{{OpExecutor.execute(OpBGP, ...) }}) when the target for the query is a 
{{GraphPropertyTable}}.  It will get a whole row, or rows, of table data and 
match the pattern with bindings.  

There are several important cases architecturally.


> Data Tables for SPARQL
> ----------------------
>
>                 Key: JENA-625
>                 URL: https://issues.apache.org/jira/browse/JENA-625
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Andy Seaborne
>              Labels: gsoc, gsoc2014, java, linked_data, mentor, rdf, sparql
>
> Temporary tables are used for keeping intermediate results available for 
> reuse with the same query or update, or use by a subsequent query.
> This project will provide temporary tables for one or both of these use cases:
> # implicit use of temporary tables for precomputed parts of basic graph 
> patterns 
> # explicit use of named temporary tables (e.g. "FROM TABLE ...")
> This project requires problem definition, solution design as well as 
> implementation.  Both use cases require modification of the SPARQL query 
> engine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to