Re: how to implement your own tripe-storage engine with inference features

Andy Seaborne Sat, 15 Jan 2022 04:01:36 -0800



On 13/01/2022 04:06, brain wrote:

Hello Andy and Jena,
       Thanks for your kindly reply.
       Ok, I will try it first.

       Another question, is there any small examples that show me how to 
implements StorageRDF.
       or  any interface for external-storage.
       I also want to try to store and query the data in a  RDMS-backend or a 
KV-based strorage.
       If there any examples I can follow, I can take baby steps and try to 
make it.
       TDB/TDB2 may be a good  example, but it looks a little hard for me .

Look at he class hierarchy for DatasetGraphStorage. There is a reallysimple implementation in DatasetGraphSimpleDB. It is for testing andverification. It does not even have any indexing. It scans for all"find" operations.


Then the question is how fast and how much effort.

The StorageRDF (the abstraction of triples and quads) gives a basiclevel of access but maybe the storage engine can do joins natively.

The general purpose OpExecutor (SPARQL algebra execution) will work butdoes not pass joins to the storage layer. It takes a storage-specificextension of OpExecutor to do that.

OpExecutorTDB2 extends OpExecutor to execute basic graph patterns, theblock of multiple triple patterns.

Optimization of SPARQL is an open-ended area but a lot of theimprovements come from 2 optimizations: joins in BGPs, and filter placement.

In TDB2, joins are performed with "node ids", not the RDF terms, Nodes,themselves. NodeId is a fixed length 64 bit number; Nodes are variablelength strings. In fact, until it needs them, TDB2 does not retrieve thefull node details of an RDF term. So if the variables are linkingpatters together and do not appear in filters or the final results, theynever get retrieved.

Doing joins better includes reordering to execute in a better order, andmaybe extending to leftjoins (OPTIONAL).

Filter placement, especially noticeable for the BSBM benchmark, is alsosignificant. It is pruning work as soon as possible.


RDMS:

The only general purpose SQL-related storage I know of that still existwork by having support for SPARQL execution inside the SQL engineitself, not layered on top. Jena had SDB which was layered butperformance for both loading and query just wasn't good and scaling waspoor. Too much overhead crossing the RDF-SQL boundaries.

If however the data has an SQL schema, then R2RML is practical. Now theSQL engine can employ native indexing and optimizations because it"knows" the data shapes.

KV:

There are two cases, depending on whether the keys are sorted (e.g.RocksDB, LMDB and several others).

TDB2 is uses sorted key indexes, and no value, only keys, for triplesand quads.

If the keys are sorted, then storing the triples 3 times in SPO, POS andOSP means there is always an index to match a triple pattern of concreteterms and some wildcards. Two indexes are enough IF you assume that apattern always has a predicate.

There are several read-centric systems that use 6 indexes for a graph oftriples. It means any sort order is available and they can always so amerge join.

If the KV store does not provide a way to use it directly as a way tosolve a pattern with wildcards, there is going to have to be somestructure on top it to do so.


    Andy

Re: how to implement your own tripe-storage engine with inference features

Reply via email to