[
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195105#comment-17195105
]
Claus Stadler commented on JENA-1894:
-------------------------------------
Hi Andy,
Yes, I was quite in a coding spree the past few days - I have done rough
implementations of these ideas about 8 years ago when I created a source
selection component for query federation (with nested maps of patricia tries
for prefix matching of IRIs) and when I was working on this issue I realized
that by now the ideas have become more clear of what the designs would actually
have (had) to look like - and so I used this opportunity to write them down in
code :)
Right now I am working on having requests for distinct projections run on the
top layer of the indexes.
(1) The idea of QuadTableWithHiddenGraphs was exactly to hide the
SpecialMarkerNode from (S,P,O,SpecialMarkerNode). I was thinking that
TupleTableCore and its domain specializations for Triples and Quads could serve
for any contract - i.e. whether SpecialMarkerNode is exposed or hidden (or
disallowed alltogether). In my opinion it is not necessary to impose
restrictions on those *TableCore interfaces and rather allow reuse of them as
building blocks as one sees fit.
*However*, now that I have worked towards generalizations on the tuple level, I
tend to think that the hiding should be implemented as an injected constraint
on the tuple level e.g. by modifying the TupleQuery object that gIdx !=
SpecialMarkerNode. Because on that level it is possible to leverage the top
layer of indexes - otherwise all pattern-matching would most likely have to run
on the generic find() method and all benefits the architecture promises are
lost.
Actually I would have liked to leverage guava's Range<> object for specifying
range constraints (of which equals is a special case) - which might be useful
when e.g. using in memory TreeSets or specialized storage. However Range<>
requires a Comparable which neither Node nor NodeValue implements. There is
NodeValue.compareAlways and NodeUtils.compareRDFTerms but I see that one may
wants to provide custom comparators.
So for now I left this out and I am only considering equality restrictions for
thee index lookups.
In the future options are either introducing a custom Range class with a
separate comparator (which guava deliberately avoided) - or rather have another
indirection to wrap a TupleComponents as a ComparableTupleComponent.
(2) So this builder pattern for creating specification of nested storage -
maps, sets and 'alternatives' has the use case of secondary indices somewhat in
mind; although I did not think it fully through. I think for essentially this
Java-collection based in-memory storage framework I created it is not that
essential - as the same instance of a quad can be referred to from any
collection.
The final component of the index does not have to be the quad or triple being
inserted - in fact the final value is the result of a function application to
an incoming tuple - which can be the identity mapping (returns the tuple
itself).
In general this value function might e.g. perform an (IO-based) dictionary
lookup.
*However* if for IO bound work I would consider adding e.g. RxJava to the
interfaces - because with this API it is easily possible to set timeouts on
requests or cancel them externally.
> It treats quads as SPOG rather than GSPO - does that have a specific
> advantage?
The choice is pretty arbitrary - but I think for tuples its more natural if
quads are seen like triples with an extra component (cf. ntriples vs nquads)
Related offtopic: I have created an RxJava version of among others the
RDFDataMgr called
[RDFDataMgrRx|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/c74c51ce601de95f834d35b7f25dcf9f80c7ea55/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/RDFDataMgrRx.java#L81].
With these reactive streams it is super easy to have sophisticated resource
management like so (pseudo code):
{code:java}
// Read data on another thread; and push the abort button if it it did not
complete in time
Disposable abortButton = RDFDataMgrRx.createFlowableTriples(filename,
Lang.NTRIPLES)
.subscribeOn(Schedulers.io()).timeout(10,
TimeUnit.SECONDS).limit(1000000).subscribe(System.out::println);
Thread.sleep(5000);
abortButton.dispose();
{code:java}
> Insert-order preserving dataset
> -------------------------------
>
> Key: JENA-1894
> URL: https://issues.apache.org/jira/browse/JENA-1894
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.14.0
> Reporter: Claus Stadler
> Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains
> insert order.
> This feature is particularly useful when changing RDF files in a git
> repository, as it makes for nice commits. An insert-order preserving
> Triple/QuadTable implementation enables:
> * Writing (subject-grouped) RDF files or events from an RDF stream out in
> nearly the same way they were read in - this makes it easier to compare
> outputs of data transformations
> * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes
> of the machinery being:
> *
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
> * In addition, I created a lazy (but adequate?) wrapper for re-using a quad
> table as a triple table:
>
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
> * The DatasetGraph wapper:
>
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
> public static DatasetGraph createOrderPreservingDatasetGraph() {
> QuadTable quadTable = new QuadTableFromNestedMaps();
> TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
> DatasetGraph result = new DatasetGraphInMemory(quadTable,
> tripleTable);
> return result;
> }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is
> transaction aware - because otherwise any SPARQL insert caused an exception
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any
> case, for the use cases of writing out RDF transactions may not even be
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered
> turtle-blocks output
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that
> essentially it only features an gspo index. Hence, the performance
> characteristics of this kind of order preserving dataset - which is intended
> mostly for serialization or presentation - varies greatly form the
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena
> and I'd gladly contribute a PR for that. My main questions are:
> * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc
> - createOrderPreservingDataset?
> * In the approach using QuadTableFromNestedMaps needed - or can a different
> implementation of QuadTable be repurposed?
> * It seems that the abstract class DatasetGraphQuads does not have any
> implementation at least in ARQ and the jena modules I use (according to
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be
> needed, or is there a similar class lying around in another jena package?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)