[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Claus Stadler (Jira) Sat, 12 Sep 2020 09:08:07 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194766#comment-17194766
 ]


Claus Stadler commented on JENA-1894:
-------------------------------------

Hi Andy,

I realized that if I wanted to project the graph nodes from a set of domain 
objects - such as quads, then it makes indeed sense to have tuple and component 
views on the original collection of domain objects. Furthermore, in the case of 
my nested LinkedHashMaps, if there was a model that captured more advanced 
'tuple queries' than just find, it would be possible to exploit the storage 
more efficiently - for example serving the graph nodes directly from a Map's 
keySet.

For this use case of projecting distinct graphs I needed the following 
operations in addition to the filtering by the find method:
* projection
* distinct

And I bundled this up into a new API:
{code}
    default Stream<Node> listGraphNodes() {
        return newFinder().projectOnly(3).distinct().stream();
    }
{code}

The most basic interface is the TupleTableCore which provides the traditional 
find() method, the newFinder() fluent but most prominently the method
```
interface TupleTableCore<DomainType, ComponentType> {
    // 'Classic' find method for tuple-like domain objects - for 
TripleTableCore and QuadTableCore this delegates to the find(s, p, o) find(g, 
s, p, o) methods
   // Sidenote: ComponentType[] or ComponentType ... (vararg) is so painful to 
use because of generic arrays so I used List<>
    Stream<DomainType> find(List<ComponentType> pattern);

    // Provide a Tuple view over the domain objects based on the given tuple 
query
    Stream<Tuple<ComponentType>> findTuples(TupleQuery tupleQuery);
}
```

The TupleFinder itself is just a fluent API wrapper around the configuration of 
a single TupleQuery object.
The TupleQuery hold the information about the projects, the constraints and 
whether to apply distinct.

Here are some test cases of this API in use on a QuadTableCore.
It does not yet exploit the nested in-memory maps, but making my storage of 
nested maps run a TupleQuery in an optimal way would be the next thing I'd look 
into. I am wondering if there are similar facilities to what I just described 
somewhere else in the depths of Jena.

In any case, here are simple but working test cases of what this API design 
delivers:

https://github.com/Aklakan/jena/blob/bb022b8c9869abb4728f2f356776c5f4a420f300/jena-db/jena-dboe-storage/src/test/java/org/apache/jena/dboe/storage/storage/TestTupleTableCore.java#L39



> Insert-order preserving dataset
> -------------------------------
>
>                 Key: JENA-1894
>                 URL: https://issues.apache.org/jira/browse/JENA-1894
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains 
> insert order.
>  This feature is particularly useful when changing RDF files in a git 
> repository, as it makes for nice commits. An insert-order preserving 
> Triple/QuadTable implementation enables:
>  * Writing (subject-grouped) RDF files or events from an RDF stream out in 
> nearly the same way they were read in - this makes it easier to compare 
> outputs of data transformations
>  * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p 
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes 
> of the machinery being:
>  * 
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
>  * In addition, I created a lazy (but adequate?) wrapper for re-using a quad 
> table as a triple table:
>  
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
>  * The DatasetGraph wapper:
>  
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
>     public static DatasetGraph createOrderPreservingDatasetGraph() {
>         QuadTable quadTable = new QuadTableFromNestedMaps();
>         TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
>         DatasetGraph result = new DatasetGraphInMemory(quadTable, 
> tripleTable);
>         return result;
>     }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is 
> transaction aware - because otherwise any SPARQL insert caused an exception 
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any 
> case, for the use cases of writing out RDF transactions may not even be 
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered 
> turtle-blocks output 
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that 
> essentially it only features an gspo index. Hence, the performance 
> characteristics of this kind of order preserving dataset - which is intended 
> mostly for serialization or presentation - varies greatly form the 
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena 
> and I'd gladly contribute a PR for that. My main questions are:
>  * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc 
> - createOrderPreservingDataset?
>  * In the approach using QuadTableFromNestedMaps needed - or can a different 
> implementation of QuadTable be repurposed?
>  * It seems that the abstract class DatasetGraphQuads does not have any 
> implementation at least in ARQ and the jena modules I use (according to 
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be 
> needed, or is there a similar class lying around in another jena package?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Reply via email to