Dear Andy, Thanks a lot! I've been studying the source code of SPARQL algebra following your instructions these days.
I've made a try-out of the optimization of querying a whole row in the property table. The following query pattern can be transformed into a row querying, without generating triples: ?x :prop1 ?v . ?x :prop2 ?w . ... It's made by using the extension point of StageGenerator, because it's now just concerned with BasicPattern. The detailed workflow goes in this way: 1. Split the incoming BasicPattern by subjects, (i.e. it becomes multiple sub BasicPatterns grouped by the same subjects. (see QueryIterPropertyTable [1]) 2. For each sub BasicPattern, if the triple size within is greater than 1 (i.e. at least 2 triples), it's turned into a row querying, and processed by QueryIterPropertyTableRow [2] else if it contains only 1 triple, it goes for the traditional triple querying by graph.graphBaseFind() The code have been committed. I'll think about other optimization ways and test with real-life large CSV data you provided later on. Any suggestion is very welcome! Best regards, Ying Jiang [1] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTable.java [2] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTableRow.java On Mon, Jun 30, 2014 at 9:47 PM, Andy Seaborne <a...@seaborne.org> wrote: > Hi Ying, > > > >> ---- Quote from the project proposal: >> Accessing from SPARQL via Graph.find() will work, but it's not ideal. >> Some optimizations can be done for processing a SPARQL basic graph >> pattern. More explicitly, in the method of OpExecutor.execute(OpBGP, >> ...), when the target for the query is a GraphPropertyTable, it can >> get a whole row, or rows, of the table data and match the pattern with >> the bindings. >> ---- >> Can you show me a concrete example of how "getting a whole row, or >> rows" can improve the query of (e.g. SUM/COUNT)? Suppose we have the >> data: >> ---- >> Town,Population >> Southton,123000 >> Northville,654000 >> ---- >> If we want to query the total population of the (two or all) towns, >> how does changing OpExecutor.execute(OpBGP, ...) help? Here's the >> query string: >> ---- >> SELECT (sum(?pop) AS ?totalPop) { >> { GRAPH <file:///c:/town.csv> { >> ?x :Popuation ?pop . >> } >> } >> ---- >> If the above example is not appropriate for for this case (2.3), could >> you please show me another one. I'm very grateful if you can help me >> with this question before your holiday. > > > > OpExecutor is the way SPARQL is evaluated. The query is parsed, truned into > SPARQL algebra (the Op* classes), optimized (yet more Op* classes and > rearranging the ones in the query) then evaluated. There is one > execute(Op...) for each Op operator in the SPARQL algebra. > > OpBGP is the operator for basic graph patterns, a basic graph pattern is a > sequence of triple patterns. The default is iteratively use Graph.find and > in the case of a single triple pattern > > The order of triple patterns matters for performance (you get the same > answer regardless of order of execution, better orders of the pattern > traverse less negative/rejected > > Consider, in some general data, not CSV related: > > ?x :predicate ?value . ## Triple pattern 1 > ?x :other 123 . ## Triple pattern 2 > > and suppose there are 1000 subjects each with :predicate and :other as > triples. And suppose there is one with predicate-value ":other 123". > > If you do the execution in the order above it's 1000 possibilities from the > first triple pattern. If the iterative OpExecutor strategy is used, its then > a lots of tests of "?x :other 123" for each possible ?x. > > But if the order is reversed, there is one match for "?x :other 123 ." then > one access for "?x :predicate ?value ." (?x is now a fixed value). > > In the default setup (in-memory and any storage that does not override and > add something better, (see StageGeneratorGeneric) the fixed reording is done > based on finding the most grounded patterns. It should choose to do ":other > 123" first. > > In the CSV case, there may be better things to do that are specific to CSV > and the property table storage. This needs investigation and design before > implementation; I don't have a definite answer. > > It will depend a lot on the datastructures for the property tables - this is > where all the pieces fit together. > > One possibility is avoiding generating triples. If the pattern is > > ?x :prop1 ?v . > ?x :prop2 ?w . > > what 's needed is bindings (a row in SPARQL results) for (?x,?v,?w). There > is not need to create a lot of triples via find(). > > Other patterns that might have optimizations are where the subject is not > the same in each pattern: > > ?x :prop1 ?y . > ?y :prop2 ?w . > > Better execution of aggregates like SUM is also possible (in your example, > it could be a matter of traversing the "population" column adding the > values, avoiding creating any triples) but I'm thinking of basic graph > patterns for the moment. > > Real CSVs files are often denormalised tables - lots of repetition from row > to row. This might be exploitable. > > There are some real worlds examples documented in: > > http://www.w3.org/TR/csvw-ucr/ > > with links to the real CSV files. > > To investigate, its probably a good idea to understand how OpExecutor works > and also to thin of some patterns and see what might be done. More > complicated data (more columns, more rows) > >> Is there a tool in the Jena code base for parsing String to Jena >> Literal for different data types? For example: >> 1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer> >> 2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double> >> 3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> >> There would be more advanced features like parsing date/time to >> xsd:date or xsd:dateTime for CSV values. >> The NodeFactory.createLiteral() method does not provide these features. > > > There is the parse for Turtle and related langauges (although 65000.123 will > be a xsd:decimal). See TokenizerText. > > You may be better off with your own code, using Long.decode pr > Long.parseLong, Double.decode or Double.parseDouble, because real data is > messy and it could be useful to clean up the input first, for example, > remove "," [*] > > Andy > > [*] Caution - locale sensitive! >