On 13/01/2022 04:06, brain wrote:
Hello Andy and Jena,
Thanks for your kindly reply.
Ok, I will try it first.
Another question, is there any small examples that show me how to
implements StorageRDF.
or any interface for external-storage.
I also want to try to store and query the data in a RDMS-backend or a
KV-based strorage.
If there any examples I can follow, I can take baby steps and try to
make it.
TDB/TDB2 may be a good example, but it looks a little hard for me .
Look at he class hierarchy for DatasetGraphStorage. There is a really
simple implementation in DatasetGraphSimpleDB. It is for testing and
verification. It does not even have any indexing. It scans for all
"find" operations.
Then the question is how fast and how much effort.
The StorageRDF (the abstraction of triples and quads) gives a basic
level of access but maybe the storage engine can do joins natively.
The general purpose OpExecutor (SPARQL algebra execution) will work but
does not pass joins to the storage layer. It takes a storage-specific
extension of OpExecutor to do that.
OpExecutorTDB2 extends OpExecutor to execute basic graph patterns, the
block of multiple triple patterns.
Optimization of SPARQL is an open-ended area but a lot of the
improvements come from 2 optimizations: joins in BGPs, and filter placement.
In TDB2, joins are performed with "node ids", not the RDF terms, Nodes,
themselves. NodeId is a fixed length 64 bit number; Nodes are variable
length strings. In fact, until it needs them, TDB2 does not retrieve the
full node details of an RDF term. So if the variables are linking
patters together and do not appear in filters or the final results, they
never get retrieved.
Doing joins better includes reordering to execute in a better order, and
maybe extending to leftjoins (OPTIONAL).
Filter placement, especially noticeable for the BSBM benchmark, is also
significant. It is pruning work as soon as possible.
RDMS:
The only general purpose SQL-related storage I know of that still exist
work by having support for SPARQL execution inside the SQL engine
itself, not layered on top. Jena had SDB which was layered but
performance for both loading and query just wasn't good and scaling was
poor. Too much overhead crossing the RDF-SQL boundaries.
If however the data has an SQL schema, then R2RML is practical. Now the
SQL engine can employ native indexing and optimizations because it
"knows" the data shapes.
KV:
There are two cases, depending on whether the keys are sorted (e.g.
RocksDB, LMDB and several others).
TDB2 is uses sorted key indexes, and no value, only keys, for triples
and quads.
If the keys are sorted, then storing the triples 3 times in SPO, POS and
OSP means there is always an index to match a triple pattern of concrete
terms and some wildcards. Two indexes are enough IF you assume that a
pattern always has a predicate.
There are several read-centric systems that use 6 indexes for a graph of
triples. It means any sort order is available and they can always so a
merge join.
If the KV store does not provide a way to use it directly as a way to
solve a pattern with wildcards, there is going to have to be some
structure on top it to do so.
Andy