[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Claus Stadler (Jira) Thu, 03 Sep 2020 15:35:16 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190417#comment-17190417
 ]


Claus Stadler commented on JENA-1894:
-------------------------------------

Hi Andy,

I took a look at PMapQuadTable however it uses the 
[dexx|https://github.com/andrewoma/dexx] collections and I did not see an easy 
way to resue plain Java LinkedHashMap's for this purpose.

I made an attempt at implementing the missing transaction support in my 
QuadTable implementation that used nested (LinkedHash)Maps. Essentially I ended 
up with creating a thread locale that gets initialized with an empty diff on 
txn begin and eventually applies any changes in the diff on commit

[QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/ba2d1f21388879a1dba5194f9f1538bcf83de510/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L22]

Is my approach valid? The goal is to have insert-order sensitivity w.r.t. the 
components in gspo order - so that graphs and subjects are grouped in the order 
they were first encountered.
I realized that to actually preserve the insert order the quads would 
additionally have to be stored in e.g. a LinkedHashSet but the diff-based 
approach should be easily adaptable - provided that it is not fundamentally 
flawed in the first place.

I can't say I have totally understood the separation of the txn code between 
DatasetGraphInMemory and QuadTable. So my impression based on  PMapQuadTable is 
that I don't have to deal with concurrent write transactions on QuadTable 
because this is handled at the Dataset level - I might be wrong though.



> Insert-order preserving dataset
> -------------------------------
>
>                 Key: JENA-1894
>                 URL: https://issues.apache.org/jira/browse/JENA-1894
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains 
> insert order.
>  This feature is particularly useful when changing RDF files in a git 
> repository, as it makes for nice commits. An insert-order preserving 
> Triple/QuadTable implementation enables:
>  * Writing (subject-grouped) RDF files or events from an RDF stream out in 
> nearly the same way they were read in - this makes it easier to compare 
> outputs of data transformations
>  * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p 
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes 
> of the machinery being:
>  * 
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
>  * In addition, I created a lazy (but adequate?) wrapper for re-using a quad 
> table as a triple table:
>  
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
>  * The DatasetGraph wapper:
>  
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
>     public static DatasetGraph createOrderPreservingDatasetGraph() {
>         QuadTable quadTable = new QuadTableFromNestedMaps();
>         TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
>         DatasetGraph result = new DatasetGraphInMemory(quadTable, 
> tripleTable);
>         return result;
>     }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is 
> transaction aware - because otherwise any SPARQL insert caused an exception 
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any 
> case, for the use cases of writing out RDF transactions may not even be 
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered 
> turtle-blocks output 
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that 
> essentially it only features an gspo index. Hence, the performance 
> characteristics of this kind of order preserving dataset - which is intended 
> mostly for serialization or presentation - varies greatly form the 
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena 
> and I'd gladly contribute a PR for that. My main questions are:
>  * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc 
> - createOrderPreservingDataset?
>  * In the approach using QuadTableFromNestedMaps needed - or can a different 
> implementation of QuadTable be repurposed?
>  * It seems that the abstract class DatasetGraphQuads does not have any 
> implementation at least in ARQ and the jena modules I use (according to 
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be 
> needed, or is there a similar class lying around in another jena package?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1894) Insert-order preserving dataset

Reply via email to