GSoC: CSV Property Tables.

Andy Seaborne Mon, 30 Jun 2014 06:48:06 -0700

Hi Ying,

---- Quote from the project proposal:
Accessing from SPARQL via Graph.find() will work, but it's not ideal.
Some optimizations can be done for processing a SPARQL basic graph
pattern. More explicitly, in the method of OpExecutor.execute(OpBGP,
...), when the target for the query is a GraphPropertyTable, it can
get a whole row, or rows, of the table data and match the pattern with
the bindings.
----
Can you show me a concrete example of how "getting a whole row, or
rows" can improve the query of (e.g. SUM/COUNT)? Suppose we have the
data:
----
Town,Population
Southton,123000
Northville,654000
----
If we want to query the total population of the (two or all) towns,
how does changing OpExecutor.execute(OpBGP, ...) help? Here's the
query string:
----
SELECT (sum(?pop) AS ?totalPop) {
{ GRAPH <file:///c:/town.csv> {
    ?x :Popuation ?pop .
  }
}
----
If the above example is not appropriate for for this case (2.3), could
you please show me another one. I'm very grateful if you can help me
with this question before your holiday.

OpExecutor is the way SPARQL is evaluated. The query is parsed, trunedinto SPARQL algebra (the Op* classes), optimized (yet more Op* classesand rearranging the ones in the query) then evaluated. There is oneexecute(Op...) for each Op operator in the SPARQL algebra.

OpBGP is the operator for basic graph patterns, a basic graph pattern isa sequence of triple patterns. The default is iteratively useGraph.find and in the case of a single triple pattern

The order of triple patterns matters for performance (you get the sameanswer regardless of order of execution, better orders of the patterntraverse less negative/rejected


Consider, in some general data, not CSV related:

?x :predicate ?value .   ## Triple pattern 1
?x :other 123 .          ## Triple pattern 2

and suppose there are 1000 subjects each with :predicate and :other astriples. And suppose there is one with predicate-value ":other 123".

If you do the execution in the order above it's 1000 possibilities fromthe first triple pattern. If the iterative OpExecutor strategy is used,its then a lots of tests of "?x :other 123" for each possible ?x.

But if the order is reversed, there is one match for "?x :other 123 ."then one access for "?x :predicate ?value ." (?x is now a fixed value).

In the default setup (in-memory and any storage that does not overrideand add something better, (see StageGeneratorGeneric) the fixed reordingis done based on finding the most grounded patterns. It should chooseto do ":other 123" first.

In the CSV case, there may be better things to do that are specific toCSV and the property table storage. This needs investigation and designbefore implementation; I don't have a definite answer.

It will depend a lot on the datastructures for the property tables -this is where all the pieces fit together.


One possibility is avoiding generating triples.  If the pattern is

?x :prop1 ?v .
?x :prop2 ?w .

what 's needed is bindings (a row in SPARQL results) for (?x,?v,?w).There is not need to create a lot of triples via find().

Other patterns that might have optimizations are where the subject isnot the same in each pattern:


?x :prop1 ?y .
?y :prop2 ?w .

Better execution of aggregates like SUM is also possible (in yourexample, it could be a matter of traversing the "population" columnadding the values, avoiding creating any triples) but I'm thinking ofbasic graph patterns for the moment.

Real CSVs files are often denormalised tables - lots of repetition fromrow to row. This might be exploitable.


There are some real worlds examples documented in:

http://www.w3.org/TR/csvw-ucr/

with links to the real CSV files.

To investigate, its probably a good idea to understand how OpExecutorworks and also to thin of some patterns and see what might be done.More complicated data (more columns, more rows)

Is there a tool in the Jena code base for parsing String to Jena
Literal for different data types? For example:
1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer>
2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double>
3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean>
There would be more advanced features like parsing date/time to
xsd:date or xsd:dateTime for CSV values.
The NodeFactory.createLiteral() method does not provide these features.

There is the parse for Turtle and related langauges (although 65000.123will be a xsd:decimal). See TokenizerText.

You may be better off with your own code, using Long.decode prLong.parseLong, Double.decode or Double.parseDouble, because real datais messy and it could be useful to clean up the input first, forexample, remove "," [*]


        Andy

[*] Caution - locale sensitive!

GSoC: CSV Property Tables.

Reply via email to