Re: GSoC: CSV Property Tables.

Ying Jiang Tue, 08 Jul 2014 19:06:08 -0700

Dear Andy,

Thanks a lot! I've been studying the source code of SPARQL algebra
following your instructions these days.


I've made a try-out of the optimization of querying a whole row in the
property table. The following query pattern can be transformed into a
row querying, without generating triples:
?x :prop1 ?v .
?x :prop2 ?w .
...

It's made by using the extension point of StageGenerator, because it's
now just concerned with BasicPattern. The detailed workflow goes in
this way:
1. Split the incoming BasicPattern by subjects, (i.e. it becomes
multiple sub BasicPatterns grouped by the same subjects. (see
QueryIterPropertyTable [1])
2. For each sub BasicPattern,
if the triple size within is greater than 1 (i.e. at least 2 triples),
it's turned into a row querying, and processed by
QueryIterPropertyTableRow [2]
else if it contains only 1 triple, it goes for the traditional triple
querying by graph.graphBaseFind()

The code have been committed. I'll think about other optimization ways
and test with real-life large CSV data you provided later on. Any
suggestion is very welcome!

Best regards,
Ying Jiang

[1] 
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTable.java
[2] 
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTableRow.java

On Mon, Jun 30, 2014 at 9:47 PM, Andy Seaborne <a...@seaborne.org> wrote:
> Hi Ying,
>
>
>
>> ---- Quote from the project proposal:
>> Accessing from SPARQL via Graph.find() will work, but it's not ideal.
>> Some optimizations can be done for processing a SPARQL basic graph
>> pattern. More explicitly, in the method of OpExecutor.execute(OpBGP,
>> ...), when the target for the query is a GraphPropertyTable, it can
>> get a whole row, or rows, of the table data and match the pattern with
>> the bindings.
>> ----
>> Can you show me a concrete example of how "getting a whole row, or
>> rows" can improve the query of (e.g. SUM/COUNT)? Suppose we have the
>> data:
>> ----
>> Town,Population
>> Southton,123000
>> Northville,654000
>> ----
>> If we want to query the total population of the (two or all) towns,
>> how does changing OpExecutor.execute(OpBGP, ...) help? Here's the
>> query string:
>> ----
>> SELECT (sum(?pop) AS ?totalPop) {
>> { GRAPH <file:///c:/town.csv> {
>>     ?x :Popuation ?pop .
>>   }
>> }
>> ----
>> If the above example is not appropriate for for this case (2.3), could
>> you please show me another one. I'm very grateful if you can help me
>> with this question before your holiday.
>
>
>
> OpExecutor is the way SPARQL is evaluated.  The query is parsed, truned into
> SPARQL algebra (the Op* classes), optimized (yet more Op* classes and
> rearranging the ones in the query) then evaluated.  There is one
> execute(Op...) for each Op operator in the SPARQL algebra.
>
> OpBGP is the operator for basic graph patterns, a basic graph pattern is a
> sequence of triple patterns.  The default is iteratively use Graph.find and
> in the case of a single triple pattern
>
> The order of triple patterns matters for performance (you get the same
> answer regardless of order of execution, better orders of the pattern
> traverse less negative/rejected
>
> Consider, in some general data, not CSV related:
>
> ?x :predicate ?value .   ## Triple pattern 1
> ?x :other 123 .          ## Triple pattern 2
>
> and suppose there are 1000 subjects each with :predicate and :other as
> triples.  And suppose there is one with predicate-value ":other 123".
>
> If you do the execution in the order above it's 1000 possibilities from the
> first triple pattern. If the iterative OpExecutor strategy is used, its then
> a lots of tests of "?x :other 123" for each possible ?x.
>
> But if the order is reversed, there is one match for "?x :other 123 ." then
> one access for "?x :predicate ?value ." (?x is now a fixed value).
>
> In the default setup (in-memory and any storage that does not override and
> add something better, (see StageGeneratorGeneric) the fixed reording is done
> based on finding the most grounded patterns.  It should choose to do ":other
> 123" first.
>
> In the CSV case, there may be better things to do that are specific to CSV
> and the property table storage.  This needs investigation and design before
> implementation; I don't have a definite answer.
>
> It will depend a lot on the datastructures for the property tables - this is
> where all the pieces fit together.
>
> One possibility is avoiding generating triples.  If the pattern is
>
> ?x :prop1 ?v .
> ?x :prop2 ?w .
>
> what 's needed is bindings (a row in SPARQL results) for (?x,?v,?w). There
> is not need to create a lot of triples via find().
>
> Other patterns that might have optimizations are where the subject is not
> the same in each pattern:
>
> ?x :prop1 ?y .
> ?y :prop2 ?w .
>
> Better execution of aggregates like SUM is also possible (in your example,
> it could be a matter of traversing the "population" column adding the
> values, avoiding creating any triples) but I'm thinking of basic graph
> patterns for the moment.
>
> Real CSVs files are often denormalised tables - lots of repetition from row
> to row.  This might be exploitable.
>
> There are some real worlds examples documented in:
>
> http://www.w3.org/TR/csvw-ucr/
>
> with links to the real CSV files.
>
> To investigate, its probably a good idea to understand how OpExecutor works
> and also to thin of some patterns and see what might be done. More
> complicated data (more columns, more rows)
>
>> Is there a tool in the Jena code base for parsing String to Jena
>> Literal for different data types? For example:
>> 1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer>
>> 2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double>
>> 3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean>
>> There would be more advanced features like parsing date/time to
>> xsd:date or xsd:dateTime for CSV values.
>> The NodeFactory.createLiteral() method does not provide these features.
>
>
> There is the parse for Turtle and related langauges (although 65000.123 will
> be a xsd:decimal). See TokenizerText.
>
> You may be better off with your own code, using Long.decode pr
> Long.parseLong, Double.decode or Double.parseDouble, because real data is
> messy and it could be useful to clean up the input first, for example,
> remove "," [*]
>
>         Andy
>
> [*] Caution - locale sensitive!
>

Re: GSoC: CSV Property Tables.

Reply via email to