Re: [GSoC 2014] Data Tables for SPARQL

Ying Jiang Sat, 15 Mar 2014 21:32:22 -0700

Dear Andy,

I greatly appreciate your detailed explanations. I've studied all the
examples and the links you mentioned. I'll try to summarise here with
further questions below:


1. We have 2 possible ways for the project: "variables-as-columns" and
"property tables". I can understand both the ideas, thanks to your
instructions. The former one has its issues you pointed out, and the
latter one seems to make more sense for the users. Do you mean we
should discard the former one and focus on the latter in this project?

2. We can have some lessons learned from SQL-to-RDF work. But CSV
(even regular-shaped CSV) is different from database in some ways,
which requires us to dig in deeper on the details. Some questions
like:
2.1 How to determine the data type of the column? All the values in
CSV are firstly parsed as Strings line by line. Suppose the parser
found a number string of "123000.0", how can we know whether it's an
integer, a float/double or even just a string in RDF?
2.2 How to deal with the namespaces? RDF requires that the subjects
and the predicates are URIs. We need to pass in the namespaces (or
just the default namespaces) to make URIs by combining the namespaces
with the values in CSV. Things may get more complicated if different
columns are to be bound with different namespaces.
2.3 The hp 2006 report [1] says "Jena supports three kinds of property
tables as well as a triple store". The "town" example you provided
conforms to the "single-valued" property table. Shall we consider the
others (e.g. the "multi-valued" one and the "triple store" one) in
this project? Does Jena in the latest release still support these
property tables? If so, where're the related source codes?
2.4 There's no "primary key" definition in CSV. All the RDF are not
OWL in fact. How do we know the column in CSV is uniquely defining? It
seems CSV lacks of some kind of "metadata" of the columns and the
values. If we have such metadata, how to pass in the namespace of  the
IRI template of http://data/town/{Town} (something related to the
question 2.2)?

3. For the "property tables" way, it seems that all we need to do is
to resolve the problems in 2., and to code "GraphCSV" accordingly. I
can make the GraphCSV class by implementing the Graph interface. In
this way, for Jena ARP, a CSV table is actually a Graph, without any
differences from other types of Graphs. It looks like that there's no
need to introduce TABLE and FROM TABLE clauses in the SPARQL language
grammar. We can just use the existing GRAPH, FROM and FROM NAMED
clauses for the CSV "property tables", can't we?

Best regards,
Ying Jiang

[1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf



On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <a...@apache.org> wrote:
> Hi Ying,
>
> Good questions.  I'll try to give a response to the specific points you've
> brought up but also there is a different I want to put forward for
> discussion.
>
> I'll write up a first draft of a project plan then we can see if the size
> and scope is realistic.
>
> You asked about whether variables are column names.  That is how TARQL and
> SPARQL VALUES works but I've realised there is a different approach and it's
> one that will give a better system.  It is to translate the CSV to RDF, and
> this may be materialized or dynamically mapped. If "materialized" it's
> likely to be a lot bigger; as "property tables" or somethign inspired by
> that idea, it'll be more compact.
>
> There are some issues with variables-as-columns include:
>
> 1/ Fixed variable names don't combine with other part of a query pattern
> very well.
>
> If there is common use of the same name it a join - that's what a natural
> join in SQL is.  If there are two tables, then ?a is overloaded.  If column
> names are used to derive a variable name, we may not want to equate them in
> the query because column names in different CSV files weren't designed with
> that in mind.
>
> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that a
> column is of years.
>
> 3/  It needs the language to change (i.e. TABLE to access it)
>
> In TARQL, which is focusing on a controlled transform from CSV to RDF, it
> works out quite nicely - variables go into the CONSTRUCT template. It
> produces RDF.
>
> Property tables are a style of approach where the CSV data is accessed as
> RDF.
>
> The data table columns be predicate URIs.  The data table itself is an RDF
> graph of regular structure.  It can be accessed with normal (unmodified)
> SPARQL syntax. It would be better if the storage and execution of that part
> of the SPARQL query were adapted to such regular data.  Something for after
> getting an initial cut down.
>
> Suppose we have a CSV file:
> -------------------
> Town,Population
> Southton,123000
> Northville,654000
> -------------------
>
> One header row, two data rows.
>
> Aside: this is regular-shaped CSV (and some CSV files are definitely not
> regular at all!). There is the current editors working draft from the CSV on
> the Web Working Group (not yet published, likely to change, only part of the
> picture, etc etc)
>
> http://w3c.github.io/csvw/syntax/
>
> which is defining a more regular data out of CSV.  This is the target for
> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>
> There is no way the working group will have standardised any CSV to RDF
> mapping in the lifetime of the GSoC project but the WG charter says it must
> be covered.  So the mapping below is made up and ahead of where the working
> group is currently but a standardized, "direct mapping" (no metadata, no
> templates) style is going to happen.  The mapping details may change but the
> general approach is clear.
>
> As RDF this might be
>
> -------------
> @prefix : <http://example/table> .
> @prefix csv: <http://w3c/future-csv-vocab/> .
>
> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
> -------------
>
> or without the bnode abbreviation:
>
> -------------
> @prefix : <http://example/table> .
> @prefix csv: <http://w3c/future-csv-vocab/> .
>
> _:b0  csv:row 1 ;
>       :Town "Southton" ;
>       :Population 123000 .
>
> _:b1  csv:row 2 ;
>       :Town "Northville" ;
>       :Population 654000 .
> -------------
>
>
> Each row is modelling one "entity" (here, a population observation). There
> is a subject (a blank node) and one predicate-value for each cell of the
> row.  Row numbers are added because it can be important.
>
> Background:
>
> A related idea for property has come up before
>
>   http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>
> That paper should only be taken as giving a flavour. The motivation was
> different, more about making RDF look like regular database especially when
> the data is regular.  At the workshop last week, I talk to Orri Erling
> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
> does something similar.
>
>
> Aside:
> There is a whole design space (outside this project) for translating CSV to
> RDF.
>
> Just if anyone is interested: see the related SQL-to-RDF work:
>
> http://www.w3.org/TR/r2rml/
> http://www.w3.org/TR/rdb-direct-mapping/
>
> If the metadata said that one of the columns was uniquely defining (a
> primary key in SQL terms, or inverse functional property in OWL-terms), we
> wouldn't need blank nodes at all - we could use a URI template, for if town
> names were unique (they are not!) a IRI template of http://data/town/{Town}
> would give:
>
> -------------
> @prefix : <http://example/table> .
> @prefix csv: <http://w3c/future-csv-vocab/> .
>
> <http://data/town/Southton>
>       csv:row 1 ;
>       rdfs:label "Southton" ;
>       :Population 123000 .
>
> <http://data/town/Northville>
>       csv:row 2 ;
>       rdfs:label "Northville" ;
>       :Population 654000 .
> -------------
>
> Doing this transformation in rules is one route.  JENA-650 connection?
> </aside>
>
> In SPARQL:
>
> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
> used.  Multiple CSVs files can be multiple graphs in one dataset to give
> query across different data sources.
>
> # Towns over 500,000 people.
> SELECT ?townName ?pop {
> { GRAPH <http://example/population> {
>     ?x :Town ?townName ;
>        :Popuation ?pop .
>     FILTER(?pop > 500000)
>   }
> }
>
>
> A few comments inline - the bulk of this message is above.
>
> I hope this makes some sense.  Having spent time with people who really do
> work with CSVs files last week around the linked geospatial workshop , the
> user needs and requirements are much clearer.
>
>         Andy
>
> PS I was on a panel that included mentioning the work you did last year.  It
> went well.
>
> On 07/03/14 12:10, Ying Jiang wrote:
> ...
>
>>>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>>>    - Questions:
>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>> the table after the query execution, or keep it in-memory for later
>>>> reuse with the same query or update, or use by a subsequent query?
>>>> When will the table be discarded?
>>>
>>>
>>>
>>> That'll need refining but a way to read and reuse.  There needs to be
>>> away
>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading
>>> CSVs
>>> to get the ???) because ...
>>
>>
>> When will the tables be passed in? TARQL loads the CSVs when parsing
>> the SPARQL query string. Shall we load the tables and create the Map
>> before querying and cache them for resue? This could be similar to
>> querying a Dataset, and the simplest way goes something like:
>>
>> DataTableMap<String, DataTable> dtm =
>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>> URI of the DataTables loaded.
>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>> table data are loaded when added into the map.
>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>> the table data are *lazy* loaded during querying later on, i.e. not
>> loaded now.
>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>> created for parsing TABLE and FROM TABLE clauses. However the
>> QueryFactory interface remains the same as before.
>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>> dtm
>> ... //dtm can be reused later on for other QueryExecutions, or be
>> discarded when the app ends.
>>
>> Is the above what you mean? Any comments?
>
>
> Yes, using TABLE.
>
> With property tables it can be done as
>
> // Default graph of the dataset
>
> Model csv1 =
>   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>
> or for multiple CSV files and/or other RDF data:
>
> Model csv1 =
>   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
> Model csv2 =
>   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>
> Dataset dataset = ... ;
> dataset.addNamedModel("http://example/population";, csv1) ;
> dataset.addNamedModel("http://example/table2";, csv2) ;
>
> ... normal SPARQL execution ...
>
>
>>>
>>>
>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>    - Questions:
>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>
>>>
>>>
>>> FROM TABLE would be one way to get tables into the query as would passing
>>> it
>>> in in the query context.
>>>
>>> Queries can't be assumed to
>>>
>>> TABLE in a query is accessing the table, using it to get the
>>>
>>> TARQL, and I've only read the documentation, is a query over a single CSV
>>> file.  This project should be about multiple CSVs and combining with
>>> other
>>> RDF data.
>>>
>>> A quick sketch and the syntax is not checked as sensible:
>>>
>>> SELECT ... {
>>>    # Fixed column names
>>>    TABLE <uri> {
>>>       BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>       BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>       FILTER (?v > 57)
>>>    }
>>> }
>>>
>>> More ambitious to have column naming and FILTERs:
>>>
>>> SELECT ...
>>> WHERE {
>>>
>>>     TABLE <uri> { "col1" AS ?myVar1 ,
>>>                   "col10" AS ?V ,
>>>                   "col5" AS ?appName
>>>                   FILTER(?V > 57) }
>>> }
>>>
>>> creates a set of bindings based on access description.
>>>
>>
>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>> understand the TABLE clauses from the examples. However, still not
>> sure about FROM TABLE. Could you please show me some query string
>> examples containing the FROM TABLE clauses?
>
>
> FROM TABLE would set the map entry.  c.f. FROM NAMED
>
> In this case the name of the table (graph) is the location it comes from -
> it's not a general choice of name.  A common issue for FROM NAMED, not
> specific to CSV processing.
>

Re: [GSoC 2014] Data Tables for SPARQL

Reply via email to