On 13/01/2022 04:06, brain wrote:
Hello Andy and Jena,
       Thanks for your kindly reply.
       Ok, I will try it first.

       Another question, is there any small examples that show me how to 
implements StorageRDF.
       or  any interface for external-storage.
       I also want to try to store and query the data in a  RDMS-backend or a 
KV-based strorage.
       If there any examples I can follow, I can take baby steps and try to 
make it.
       TDB/TDB2 may be a good  example, but it looks a little hard for me .

Look at he class hierarchy for DatasetGraphStorage. There is a really simple implementation in DatasetGraphSimpleDB. It is for testing and verification. It does not even have any indexing. It scans for all "find" operations.

Then the question is how fast and how much effort.

The StorageRDF (the abstraction of triples and quads) gives a basic level of access but maybe the storage engine can do joins natively.

The general purpose OpExecutor (SPARQL algebra execution) will work but does not pass joins to the storage layer. It takes a storage-specific extension of OpExecutor to do that.

OpExecutorTDB2 extends OpExecutor to execute basic graph patterns, the block of multiple triple patterns.

Optimization of SPARQL is an open-ended area but a lot of the improvements come from 2 optimizations: joins in BGPs, and filter placement.

In TDB2, joins are performed with "node ids", not the RDF terms, Nodes, themselves. NodeId is a fixed length 64 bit number; Nodes are variable length strings. In fact, until it needs them, TDB2 does not retrieve the full node details of an RDF term. So if the variables are linking patters together and do not appear in filters or the final results, they never get retrieved.

Doing joins better includes reordering to execute in a better order, and maybe extending to leftjoins (OPTIONAL).

Filter placement, especially noticeable for the BSBM benchmark, is also significant. It is pruning work as soon as possible.

RDMS:

The only general purpose SQL-related storage I know of that still exist work by having support for SPARQL execution inside the SQL engine itself, not layered on top. Jena had SDB which was layered but performance for both loading and query just wasn't good and scaling was poor. Too much overhead crossing the RDF-SQL boundaries.

If however the data has an SQL schema, then R2RML is practical. Now the SQL engine can employ native indexing and optimizations because it "knows" the data shapes.

KV:

There are two cases, depending on whether the keys are sorted (e.g. RocksDB, LMDB and several others).

TDB2 is uses sorted key indexes, and no value, only keys, for triples and quads.

If the keys are sorted, then storing the triples 3 times in SPO, POS and OSP means there is always an index to match a triple pattern of concrete terms and some wildcards. Two indexes are enough IF you assume that a pattern always has a predicate.

There are several read-centric systems that use 6 indexes for a graph of triples. It means any sort order is available and they can always so a merge join.

If the KV store does not provide a way to use it directly as a way to solve a pattern with wildcards, there is going to have to be some structure on top it to do so.

    Andy

Reply via email to