Re: GSoC: CSV Property Tables.

Ying Jiang Thu, 17 Jul 2014 00:33:29 -0700

Dear Andy,

I made an implementation of PropertyTable using a two dimension array
[1]. The old HashMap implementation is not deleted, which I'll use for
the performance tests. I hope the array implementation can go well
with less memory consumption and querying response time, especially
for the query-a-whole case.


Best regards,
Ying Jiang

[1] 
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/PropertyTableArrayImpl.java

On Wed, Jul 9, 2014 at 10:05 AM, Ying Jiang <jpz6311...@gmail.com> wrote:
> Dear Andy,
>
> Thanks a lot! I've been studying the source code of SPARQL algebra
> following your instructions these days.
>
> I've made a try-out of the optimization of querying a whole row in the
> property table. The following query pattern can be transformed into a
> row querying, without generating triples:
> ?x :prop1 ?v .
> ?x :prop2 ?w .
> ...
>
> It's made by using the extension point of StageGenerator, because it's
> now just concerned with BasicPattern. The detailed workflow goes in
> this way:
> 1. Split the incoming BasicPattern by subjects, (i.e. it becomes
> multiple sub BasicPatterns grouped by the same subjects. (see
> QueryIterPropertyTable [1])
> 2. For each sub BasicPattern,
> if the triple size within is greater than 1 (i.e. at least 2 triples),
> it's turned into a row querying, and processed by
> QueryIterPropertyTableRow [2]
> else if it contains only 1 triple, it goes for the traditional triple
> querying by graph.graphBaseFind()
>
> The code have been committed. I'll think about other optimization ways
> and test with real-life large CSV data you provided later on. Any
> suggestion is very welcome!
>
> Best regards,
> Ying Jiang
>
> [1] 
> https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTable.java
> [2] 
> https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/impl/QueryIterPropertyTableRow.java
>
> On Mon, Jun 30, 2014 at 9:47 PM, Andy Seaborne <a...@seaborne.org> wrote:
>> Hi Ying,
>>
>>
>>
>>> ---- Quote from the project proposal:
>>> Accessing from SPARQL via Graph.find() will work, but it's not ideal.
>>> Some optimizations can be done for processing a SPARQL basic graph
>>> pattern. More explicitly, in the method of OpExecutor.execute(OpBGP,
>>> ...), when the target for the query is a GraphPropertyTable, it can
>>> get a whole row, or rows, of the table data and match the pattern with
>>> the bindings.
>>> ----
>>> Can you show me a concrete example of how "getting a whole row, or
>>> rows" can improve the query of (e.g. SUM/COUNT)? Suppose we have the
>>> data:
>>> ----
>>> Town,Population
>>> Southton,123000
>>> Northville,654000
>>> ----
>>> If we want to query the total population of the (two or all) towns,
>>> how does changing OpExecutor.execute(OpBGP, ...) help? Here's the
>>> query string:
>>> ----
>>> SELECT (sum(?pop) AS ?totalPop) {
>>> { GRAPH <file:///c:/town.csv> {
>>>     ?x :Popuation ?pop .
>>>   }
>>> }
>>> ----
>>> If the above example is not appropriate for for this case (2.3), could
>>> you please show me another one. I'm very grateful if you can help me
>>> with this question before your holiday.
>>
>>
>>
>> OpExecutor is the way SPARQL is evaluated.  The query is parsed, truned into
>> SPARQL algebra (the Op* classes), optimized (yet more Op* classes and
>> rearranging the ones in the query) then evaluated.  There is one
>> execute(Op...) for each Op operator in the SPARQL algebra.
>>
>> OpBGP is the operator for basic graph patterns, a basic graph pattern is a
>> sequence of triple patterns.  The default is iteratively use Graph.find and
>> in the case of a single triple pattern
>>
>> The order of triple patterns matters for performance (you get the same
>> answer regardless of order of execution, better orders of the pattern
>> traverse less negative/rejected
>>
>> Consider, in some general data, not CSV related:
>>
>> ?x :predicate ?value .   ## Triple pattern 1
>> ?x :other 123 .          ## Triple pattern 2
>>
>> and suppose there are 1000 subjects each with :predicate and :other as
>> triples.  And suppose there is one with predicate-value ":other 123".
>>
>> If you do the execution in the order above it's 1000 possibilities from the
>> first triple pattern. If the iterative OpExecutor strategy is used, its then
>> a lots of tests of "?x :other 123" for each possible ?x.
>>
>> But if the order is reversed, there is one match for "?x :other 123 ." then
>> one access for "?x :predicate ?value ." (?x is now a fixed value).
>>
>> In the default setup (in-memory and any storage that does not override and
>> add something better, (see StageGeneratorGeneric) the fixed reording is done
>> based on finding the most grounded patterns.  It should choose to do ":other
>> 123" first.
>>
>> In the CSV case, there may be better things to do that are specific to CSV
>> and the property table storage.  This needs investigation and design before
>> implementation; I don't have a definite answer.
>>
>> It will depend a lot on the datastructures for the property tables - this is
>> where all the pieces fit together.
>>
>> One possibility is avoiding generating triples.  If the pattern is
>>
>> ?x :prop1 ?v .
>> ?x :prop2 ?w .
>>
>> what 's needed is bindings (a row in SPARQL results) for (?x,?v,?w). There
>> is not need to create a lot of triples via find().
>>
>> Other patterns that might have optimizations are where the subject is not
>> the same in each pattern:
>>
>> ?x :prop1 ?y .
>> ?y :prop2 ?w .
>>
>> Better execution of aggregates like SUM is also possible (in your example,
>> it could be a matter of traversing the "population" column adding the
>> values, avoiding creating any triples) but I'm thinking of basic graph
>> patterns for the moment.
>>
>> Real CSVs files are often denormalised tables - lots of repetition from row
>> to row.  This might be exploitable.
>>
>> There are some real worlds examples documented in:
>>
>> http://www.w3.org/TR/csvw-ucr/
>>
>> with links to the real CSV files.
>>
>> To investigate, its probably a good idea to understand how OpExecutor works
>> and also to thin of some patterns and see what might be done. More
>> complicated data (more columns, more rows)
>>
>>> Is there a tool in the Jena code base for parsing String to Jena
>>> Literal for different data types? For example:
>>> 1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer>
>>> 2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double>
>>> 3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean>
>>> There would be more advanced features like parsing date/time to
>>> xsd:date or xsd:dateTime for CSV values.
>>> The NodeFactory.createLiteral() method does not provide these features.
>>
>>
>> There is the parse for Turtle and related langauges (although 65000.123 will
>> be a xsd:decimal). See TokenizerText.
>>
>> You may be better off with your own code, using Long.decode pr
>> Long.parseLong, Double.decode or Double.parseDouble, because real data is
>> messy and it could be useful to clean up the input first, for example,
>> remove "," [*]
>>
>>         Andy
>>
>> [*] Caution - locale sensitive!
>>

Re: GSoC: CSV Property Tables.

Reply via email to