[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Claus Stadler (Jira) Thu, 17 Sep 2020 10:19:13 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197849#comment-17197849
 ]


Claus Stadler commented on JENA-1894:
-------------------------------------

I have now finished the generalized storage system which allows one to assemble 
the way quads and triples are stored within standard java collections - most 
prominently LinkedHashMaps and LinkedHashSets but also TreeSets (in case 
predicates should be internally stored ordered by name).

This way one can customize the storage to which extent it should be insert 
order aware/preserving. Using nested LinkedHashMaps does not preserve the 
insert order - however the data will stored in the order in which certain 
components - such as graphs or subjects - were first encountered.

I have adapted the test [suite runner to my 
code|https://github.com/Aklakan/jena/blob/84d78cbf1fe19ef45a7e015443384d1e49125de4/jena-db/jena-dboe-storage/src/test/java/org/apache/jena/dboe/storage/advanced/storage/TS_TupleAbstraction.java#L14]
 and it succeeds on the all 45 tests. I will do additional tests for the order 
awareness. Because the way the machinery works triples and quads are treated 
separately - but this should not be an all too big issue.

It is now possibly to pretty much customize the storage layout as done in 
[DatasetGraphFactoryOrdered|https://github.com/Aklakan/jena/blob/84d78cbf1fe19ef45a7e015443384d1e49125de4/jena-db/jena-dboe-storage/src/main/java/org/apache/jena/dboe/storage/advanced/core/DatasetGraphFactoryOrdered.java#L130]:

The storage layout for insert order awareness looks usually like this:
{code:java}
        storageLayout =
        alt2( // <- a node with alternative storage paths
            innerMap(0, LinkedHashMap::new,
                    innerMap(1, LinkedHashMap::new,
                        leafMap(2, tupleAccessors, LinkedHashMap::new))),
            leafSet(tupleAccessors, LinkedHashSet::new))
{code}

This structure represents a nested index where there root forks into nested 
maps or a linked hash set. At present the assumption is that leaf nodes contain 
the domain tuples (i.e. triples or quads).

The machinery which I added prefers answering patterns (e.g. find(Node.ANY, 
RDF.type, foo, bar) using index nodes.
In the case of an  unconstrained pattern such as ?g ?s ?p ?o the machinery 
prefers to serve the data from the leaf node with the least depth - hence it 
will end up at the leafSet in the example above.
Likewise if there is an index on G which forks into S->P->O and a leaf set, 
then requesting data for a specific G will prefer the leaf set rather then 
taking the extra effort of iterating through the nested maps.

 The candidate ranking is [quite 
simple|https://github.com/Aklakan/jena/blob/JENA-1894/jena-db/jena-dboe-storage/src/main/java/org/apache/jena/dboe/storage/advanced/tuple/analysis/NodeStatsComparator.java#L43].
 So it contains the essentials of a query engine on the tuple level, but its 
not very sophisticated - so far it e.g. has no special handling of the (?s 
rdf:type Foo) pattern which is not selective on O. After all, the issue is also 
about insert order awareness / preservation.

The machinery also supports projecting on the tuple components and does 
optimization to serve them from - what Andy called - "the index surface". The 
design allows to serve components from the index nodes pretty much directly - 
without e.g. going though intermediate 1-tuples.
Under the hood there is a cartesian product (constrained to a pattern - cf. 
find()) in conjunction with a reducer function which assembles the result only 
from the projected components.

This allows e.g. for serving the graph names 
[efficiently in a generic way 
|https://github.com/Aklakan/jena/blob/88a1e40c7ba2e8842f8d34750bc64f4b24a95535/jena-db/jena-dboe-storage/src/main/java/org/apache/jena/dboe/storage/advanced/quad/QuadTableCoreFromStorageNode.java#L40].

The question now is how how to to get this issue to an end.
>From my perspective the goal is establishing the insert order awareness. For 
>this I am writing additional tests and I will fixing the machinery within this 
>scope - however making the implementation of machinery ready as general 
>purpose query engine would certainly be a separate issuee.




> Insert-order preserving dataset
> -------------------------------
>
>                 Key: JENA-1894
>                 URL: https://issues.apache.org/jira/browse/JENA-1894
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains 
> insert order.
>  This feature is particularly useful when changing RDF files in a git 
> repository, as it makes for nice commits. An insert-order preserving 
> Triple/QuadTable implementation enables:
>  * Writing (subject-grouped) RDF files or events from an RDF stream out in 
> nearly the same way they were read in - this makes it easier to compare 
> outputs of data transformations
>  * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p 
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes 
> of the machinery being:
>  * 
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
>  * In addition, I created a lazy (but adequate?) wrapper for re-using a quad 
> table as a triple table:
>  
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
>  * The DatasetGraph wapper:
>  
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
>     public static DatasetGraph createOrderPreservingDatasetGraph() {
>         QuadTable quadTable = new QuadTableFromNestedMaps();
>         TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
>         DatasetGraph result = new DatasetGraphInMemory(quadTable, 
> tripleTable);
>         return result;
>     }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is 
> transaction aware - because otherwise any SPARQL insert caused an exception 
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any 
> case, for the use cases of writing out RDF transactions may not even be 
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered 
> turtle-blocks output 
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that 
> essentially it only features an gspo index. Hence, the performance 
> characteristics of this kind of order preserving dataset - which is intended 
> mostly for serialization or presentation - varies greatly form the 
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena 
> and I'd gladly contribute a PR for that. My main questions are:
>  * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc 
> - createOrderPreservingDataset?
>  * In the approach using QuadTableFromNestedMaps needed - or can a different 
> implementation of QuadTable be repurposed?
>  * It seems that the abstract class DatasetGraphQuads does not have any 
> implementation at least in ARQ and the jena modules I use (according to 
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be 
> needed, or is there a similar class lying around in another jena package?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Reply via email to