[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Claus Stadler (Jira) Sun, 06 Sep 2020 07:50:09 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191290#comment-17191290
 ]


Claus Stadler commented on JENA-1894:
-------------------------------------

After experimenting with different designs the following one is the one which I 
think has good modularity and even allows for strict order preservation. I 
still need to write test cases (checking whether the dataset implementation 
works correctly and in addition that the insert order matches expected ones), 
but this the design I have come up with which you may want to comment on:

I have created an implementation of StorageRDF called 
[StorageRDFBasic|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/6b0a318d92c35151729f12b048ed108d95847a5e/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/dboe/StorageRDFBasic.java#L36]
 ]which - analogous to DatasetGraphInMemory - holds a 
[TripleTableCore|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/develop/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/dboe/TripleTableCore.java]
 and 
[QuadTableCore|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/develop/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/dboe/QuadTableCore.java#L17].
 (The naming is preliminary).
 These classes are essentially just in memory collections of Triples and Quads 
with a find method.

The find method at present returns a Stream which is nice for read operations - 
but of course the advantage of Iterator would be that with (quite some) 
additional work removals could be implemented without the need to copy data.

An implementation of QuadTableCore is backed by a Map<Node, TripleTableCore> is 
[QuadTableCoreFromMapOfTripleTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/develop/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/dboe/QuadTableCoreFromMapOfTripleTable.java#L11]

To completely preserve insert order a QuadTableCore can be wrapped with 
[QuadTableWithInsertOrderPreservation.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/e4ae89ba6f1b36bc17109fec2816254e87ff1395/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/dboe/QuadTableWithInsertOrderPreservation.java#L17].
 This manages a separate set (typically LinkedHashSet) whose items are returned 
when invoking find() without constraints.

Exactly the same implementations exist for TripleTableCore and TripleTable.

I assemble the final DatasetGraph in the following way:
 Note that this allows preserving insert order separately on the quad level and 
each graph's triple level.
{code:java}
    public static DatasetGraph createInsertOrderPreservingDataset(boolean 
strictOrderOnQuads, boolean strictOrderOnTriples) {
        Supplier<TripleTableCore> tripleTableSupplier = strictOrderOnTriples
                ? () -> new TripleTableWithInsertOrderPreservation(new 
TripleTableCoreFromNestedMapsImpl())
                : () -> new TripleTableCoreFromNestedMapsImpl();

        QuadTableCore quadTable = new 
QuadTableCoreFromMapOfTripleTable(tripleTableSupplier);

        if (strictOrderOnQuads) {
            quadTable = new QuadTableWithInsertOrderPreservation(quadTable);
        }

        StorageRDF storage = StorageRDFBasic.createWithQuadsOnly(quadTable);
        DatasetGraph result = new DatasetGraphStorage(storage, new 
StoragePrefixesMem(), TransactionalLock.createMRSW());
        return result;
    }
{code}
Sidenote about tuples: While I see that the use of tuple objects for 
generalization of Triples/Quads seems intriguing for these implementations they 
did not make things easier. Also, QuadTableCore / TripleTableCore can be seen 
as frontend interfaces (front-facing) - further implementation could still 
delegate from these domain objects to some tuple-driven representation

> Insert-order preserving dataset
> -------------------------------
>
>                 Key: JENA-1894
>                 URL: https://issues.apache.org/jira/browse/JENA-1894
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains 
> insert order.
>  This feature is particularly useful when changing RDF files in a git 
> repository, as it makes for nice commits. An insert-order preserving 
> Triple/QuadTable implementation enables:
>  * Writing (subject-grouped) RDF files or events from an RDF stream out in 
> nearly the same way they were read in - this makes it easier to compare 
> outputs of data transformations
>  * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p 
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes 
> of the machinery being:
>  * 
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
>  * In addition, I created a lazy (but adequate?) wrapper for re-using a quad 
> table as a triple table:
>  
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
>  * The DatasetGraph wapper:
>  
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
>     public static DatasetGraph createOrderPreservingDatasetGraph() {
>         QuadTable quadTable = new QuadTableFromNestedMaps();
>         TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
>         DatasetGraph result = new DatasetGraphInMemory(quadTable, 
> tripleTable);
>         return result;
>     }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is 
> transaction aware - because otherwise any SPARQL insert caused an exception 
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any 
> case, for the use cases of writing out RDF transactions may not even be 
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered 
> turtle-blocks output 
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that 
> essentially it only features an gspo index. Hence, the performance 
> characteristics of this kind of order preserving dataset - which is intended 
> mostly for serialization or presentation - varies greatly form the 
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena 
> and I'd gladly contribute a PR for that. My main questions are:
>  * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc 
> - createOrderPreservingDataset?
>  * In the approach using QuadTableFromNestedMaps needed - or can a different 
> implementation of QuadTable be repurposed?
>  * It seems that the abstract class DatasetGraphQuads does not have any 
> implementation at least in ARQ and the jena modules I use (according to 
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be 
> needed, or is there a similar class lying around in another jena package?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Reply via email to