[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Claus Stadler (Jira) Wed, 07 Oct 2020 09:36:25 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209663#comment-17209663
 ]


Claus Stadler commented on JENA-1894:
-------------------------------------

Hi Adam,

Thank you for your clarifications - good to have the confirmation that I am not 
reinventing the wheel here :)

As for the tentris / tensor approach:  the math looks horribly complicated, but 
the idea is quite simple: The main difference is that the tensor approach is 
variable-centric rather than triple pattern-centric. Based on the hypertrie 
index one can quickly find out the sets of values a variable in a specific 
triple pattern can take where some components are constrained to a  constant.
Now if a variable occurs in one or more triple patterns, one can create a 
conjunction of their known sets of values. Each value in this conjunction 
corresponds to a candidate contribution to the subset of the final set of 
bindings which has that variable bound to that value.

The process recursively picks a variable, creates the conjunction of values 
based on the hypertrie index, and for each value this is recursively repeated 
until all variables are bound. So its like a cartesian product over the 
variables' values - with guidance by the index. 
This also means, that triple pattern order does not matter but the order in 
which variables are picked does.

If a query uses distinct then query evaluation can use some short-cuts. For 
example, once the recursion found a candidate binding for all distinguished 
variables it can yield the current binding and continue with the next candidate 
binding of the distinguished variables if the remaining undistinguished 
variables are unconstrained (neither joining nor filtered) - so there may be no 
need to iterate all the values of non-distinguished variables.

I hope my descriptions help in understanding the concept :)

I will draft a document with the concept / current state of my implementation 
until next week including the strengths/weaknesses that I see as a base for 
further discussion. Right now I am setting up some correctness tests using 
ResultSetCompare to ensure that my benchmark results are not bogus :)




> Insert-order preserving dataset
> -------------------------------
>
>                 Key: JENA-1894
>                 URL: https://issues.apache.org/jira/browse/JENA-1894
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains 
> insert order.
>  This feature is particularly useful when changing RDF files in a git 
> repository, as it makes for nice commits. An insert-order preserving 
> Triple/QuadTable implementation enables:
>  * Writing (subject-grouped) RDF files or events from an RDF stream out in 
> nearly the same way they were read in - this makes it easier to compare 
> outputs of data transformations
>  * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p 
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes 
> of the machinery being:
>  * 
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
>  * In addition, I created a lazy (but adequate?) wrapper for re-using a quad 
> table as a triple table:
>  
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
>  * The DatasetGraph wapper:
>  
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
>     public static DatasetGraph createOrderPreservingDatasetGraph() {
>         QuadTable quadTable = new QuadTableFromNestedMaps();
>         TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
>         DatasetGraph result = new DatasetGraphInMemory(quadTable, 
> tripleTable);
>         return result;
>     }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is 
> transaction aware - because otherwise any SPARQL insert caused an exception 
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any 
> case, for the use cases of writing out RDF transactions may not even be 
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered 
> turtle-blocks output 
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that 
> essentially it only features an gspo index. Hence, the performance 
> characteristics of this kind of order preserving dataset - which is intended 
> mostly for serialization or presentation - varies greatly form the 
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena 
> and I'd gladly contribute a PR for that. My main questions are:
>  * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc 
> - createOrderPreservingDataset?
>  * In the approach using QuadTableFromNestedMaps needed - or can a different 
> implementation of QuadTable be repurposed?
>  * It seems that the abstract class DatasetGraphQuads does not have any 
> implementation at least in ARQ and the jena modules I use (according to 
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be 
> needed, or is there a similar class lying around in another jena package?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Reply via email to