Dear Andy, I made an implementation of PropertyTable using a two dimension array [1]. The old HashMap implementation is not deleted, which I'll use for the performance tests. I hope the array implementation can go well with less memory consumption and querying response time, especially for the query-a-whole case.
Best regards, Ying Jiang [1] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/PropertyTableArrayImpl.java On Wed, Jul 9, 2014 at 10:05 AM, Ying Jiang <jpz6311...@gmail.com> wrote: > Dear Andy, > > Thanks a lot! I've been studying the source code of SPARQL algebra > following your instructions these days. > > I've made a try-out of the optimization of querying a whole row in the > property table. The following query pattern can be transformed into a > row querying, without generating triples: > ?x :prop1 ?v . > ?x :prop2 ?w . > ... > > It's made by using the extension point of StageGenerator, because it's > now just concerned with BasicPattern. The detailed workflow goes in > this way: > 1. Split the incoming BasicPattern by subjects, (i.e. it becomes > multiple sub BasicPatterns grouped by the same subjects. (see > QueryIterPropertyTable [1]) > 2. For each sub BasicPattern, > if the triple size within is greater than 1 (i.e. at least 2 triples), > it's turned into a row querying, and processed by > QueryIterPropertyTableRow [2] > else if it contains only 1 triple, it goes for the traditional triple > querying by graph.graphBaseFind() > > The code have been committed. I'll think about other optimization ways > and test with real-life large CSV data you provided later on. Any > suggestion is very welcome! > > Best regards, > Ying Jiang > > [1] > https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTable.java > [2] > https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTableRow.java > > On Mon, Jun 30, 2014 at 9:47 PM, Andy Seaborne <a...@seaborne.org> wrote: >> Hi Ying, >> >> >> >>> ---- Quote from the project proposal: >>> Accessing from SPARQL via Graph.find() will work, but it's not ideal. >>> Some optimizations can be done for processing a SPARQL basic graph >>> pattern. More explicitly, in the method of OpExecutor.execute(OpBGP, >>> ...), when the target for the query is a GraphPropertyTable, it can >>> get a whole row, or rows, of the table data and match the pattern with >>> the bindings. >>> ---- >>> Can you show me a concrete example of how "getting a whole row, or >>> rows" can improve the query of (e.g. SUM/COUNT)? Suppose we have the >>> data: >>> ---- >>> Town,Population >>> Southton,123000 >>> Northville,654000 >>> ---- >>> If we want to query the total population of the (two or all) towns, >>> how does changing OpExecutor.execute(OpBGP, ...) help? Here's the >>> query string: >>> ---- >>> SELECT (sum(?pop) AS ?totalPop) { >>> { GRAPH <file:///c:/town.csv> { >>> ?x :Popuation ?pop . >>> } >>> } >>> ---- >>> If the above example is not appropriate for for this case (2.3), could >>> you please show me another one. I'm very grateful if you can help me >>> with this question before your holiday. >> >> >> >> OpExecutor is the way SPARQL is evaluated. The query is parsed, truned into >> SPARQL algebra (the Op* classes), optimized (yet more Op* classes and >> rearranging the ones in the query) then evaluated. There is one >> execute(Op...) for each Op operator in the SPARQL algebra. >> >> OpBGP is the operator for basic graph patterns, a basic graph pattern is a >> sequence of triple patterns. The default is iteratively use Graph.find and >> in the case of a single triple pattern >> >> The order of triple patterns matters for performance (you get the same >> answer regardless of order of execution, better orders of the pattern >> traverse less negative/rejected >> >> Consider, in some general data, not CSV related: >> >> ?x :predicate ?value . ## Triple pattern 1 >> ?x :other 123 . ## Triple pattern 2 >> >> and suppose there are 1000 subjects each with :predicate and :other as >> triples. And suppose there is one with predicate-value ":other 123". >> >> If you do the execution in the order above it's 1000 possibilities from the >> first triple pattern. If the iterative OpExecutor strategy is used, its then >> a lots of tests of "?x :other 123" for each possible ?x. >> >> But if the order is reversed, there is one match for "?x :other 123 ." then >> one access for "?x :predicate ?value ." (?x is now a fixed value). >> >> In the default setup (in-memory and any storage that does not override and >> add something better, (see StageGeneratorGeneric) the fixed reording is done >> based on finding the most grounded patterns. It should choose to do ":other >> 123" first. >> >> In the CSV case, there may be better things to do that are specific to CSV >> and the property table storage. This needs investigation and design before >> implementation; I don't have a definite answer. >> >> It will depend a lot on the datastructures for the property tables - this is >> where all the pieces fit together. >> >> One possibility is avoiding generating triples. If the pattern is >> >> ?x :prop1 ?v . >> ?x :prop2 ?w . >> >> what 's needed is bindings (a row in SPARQL results) for (?x,?v,?w). There >> is not need to create a lot of triples via find(). >> >> Other patterns that might have optimizations are where the subject is not >> the same in each pattern: >> >> ?x :prop1 ?y . >> ?y :prop2 ?w . >> >> Better execution of aggregates like SUM is also possible (in your example, >> it could be a matter of traversing the "population" column adding the >> values, avoiding creating any triples) but I'm thinking of basic graph >> patterns for the moment. >> >> Real CSVs files are often denormalised tables - lots of repetition from row >> to row. This might be exploitable. >> >> There are some real worlds examples documented in: >> >> http://www.w3.org/TR/csvw-ucr/ >> >> with links to the real CSV files. >> >> To investigate, its probably a good idea to understand how OpExecutor works >> and also to thin of some patterns and see what might be done. More >> complicated data (more columns, more rows) >> >>> Is there a tool in the Jena code base for parsing String to Jena >>> Literal for different data types? For example: >>> 1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer> >>> 2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double> >>> 3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> >>> There would be more advanced features like parsing date/time to >>> xsd:date or xsd:dateTime for CSV values. >>> The NodeFactory.createLiteral() method does not provide these features. >> >> >> There is the parse for Turtle and related langauges (although 65000.123 will >> be a xsd:decimal). See TokenizerText. >> >> You may be better off with your own code, using Long.decode pr >> Long.parseLong, Double.decode or Double.parseDouble, because real data is >> messy and it could be useful to clean up the input first, for example, >> remove "," [*] >> >> Andy >> >> [*] Caution - locale sensitive! >>