Hi Ying,


---- Quote from the project proposal:
Accessing from SPARQL via Graph.find() will work, but it's not ideal.
Some optimizations can be done for processing a SPARQL basic graph
pattern. More explicitly, in the method of OpExecutor.execute(OpBGP,
...), when the target for the query is a GraphPropertyTable, it can
get a whole row, or rows, of the table data and match the pattern with
the bindings.
----
Can you show me a concrete example of how "getting a whole row, or
rows" can improve the query of (e.g. SUM/COUNT)? Suppose we have the
data:
----
Town,Population
Southton,123000
Northville,654000
----
If we want to query the total population of the (two or all) towns,
how does changing OpExecutor.execute(OpBGP, ...) help? Here's the
query string:
----
SELECT (sum(?pop) AS ?totalPop) {
{ GRAPH <file:///c:/town.csv> {
    ?x :Popuation ?pop .
  }
}
----
If the above example is not appropriate for for this case (2.3), could
you please show me another one. I'm very grateful if you can help me
with this question before your holiday.


OpExecutor is the way SPARQL is evaluated. The query is parsed, truned into SPARQL algebra (the Op* classes), optimized (yet more Op* classes and rearranging the ones in the query) then evaluated. There is one execute(Op...) for each Op operator in the SPARQL algebra.

OpBGP is the operator for basic graph patterns, a basic graph pattern is a sequence of triple patterns. The default is iteratively use Graph.find and in the case of a single triple pattern

The order of triple patterns matters for performance (you get the same answer regardless of order of execution, better orders of the pattern traverse less negative/rejected

Consider, in some general data, not CSV related:

?x :predicate ?value .   ## Triple pattern 1
?x :other 123 .          ## Triple pattern 2

and suppose there are 1000 subjects each with :predicate and :other as triples. And suppose there is one with predicate-value ":other 123".

If you do the execution in the order above it's 1000 possibilities from the first triple pattern. If the iterative OpExecutor strategy is used, its then a lots of tests of "?x :other 123" for each possible ?x.

But if the order is reversed, there is one match for "?x :other 123 ." then one access for "?x :predicate ?value ." (?x is now a fixed value).

In the default setup (in-memory and any storage that does not override and add something better, (see StageGeneratorGeneric) the fixed reording is done based on finding the most grounded patterns. It should choose to do ":other 123" first.

In the CSV case, there may be better things to do that are specific to CSV and the property table storage. This needs investigation and design before implementation; I don't have a definite answer.

It will depend a lot on the datastructures for the property tables - this is where all the pieces fit together.

One possibility is avoiding generating triples.  If the pattern is

?x :prop1 ?v .
?x :prop2 ?w .

what 's needed is bindings (a row in SPARQL results) for (?x,?v,?w). There is not need to create a lot of triples via find().

Other patterns that might have optimizations are where the subject is not the same in each pattern:

?x :prop1 ?y .
?y :prop2 ?w .

Better execution of aggregates like SUM is also possible (in your example, it could be a matter of traversing the "population" column adding the values, avoiding creating any triples) but I'm thinking of basic graph patterns for the moment.

Real CSVs files are often denormalised tables - lots of repetition from row to row. This might be exploitable.

There are some real worlds examples documented in:

http://www.w3.org/TR/csvw-ucr/

with links to the real CSV files.

To investigate, its probably a good idea to understand how OpExecutor works and also to thin of some patterns and see what might be done. More complicated data (more columns, more rows)

Is there a tool in the Jena code base for parsing String to Jena
Literal for different data types? For example:
1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer>
2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double>
3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean>
There would be more advanced features like parsing date/time to
xsd:date or xsd:dateTime for CSV values.
The NodeFactory.createLiteral() method does not provide these features.

There is the parse for Turtle and related langauges (although 65000.123 will be a xsd:decimal). See TokenizerText.

You may be better off with your own code, using Long.decode pr Long.parseLong, Double.decode or Double.parseDouble, because real data is messy and it could be useful to clean up the input first, for example, remove "," [*]

        Andy

[*] Caution - locale sensitive!

Reply via email to