Thank you Andy. It’s so nice to talk with you. I’ll look into DatasetGraphSimpleDB later.
Much appreciated for your detailed and professional answering ,I will spend more time to digest it. I will email again if I have any further questions . Thank you very much. Best Regards, Brain > On Jan 15, 2022, at 8:01 PM, Andy Seaborne <a...@apache.org> wrote: > > > On 13/01/2022 04:06, brain wrote: >> Hello Andy and Jena, >> Thanks for your kindly reply. >> Ok, I will try it first. >> Another question, is there any small examples that show me how to >> implements StorageRDF. >> or any interface for external-storage. >> I also want to try to store and query the data in a RDMS-backend or a >> KV-based strorage. >> If there any examples I can follow, I can take baby steps and try to >> make it. >> TDB/TDB2 may be a good example, but it looks a little hard for me . > > Look at he class hierarchy for DatasetGraphStorage. There is a really simple > implementation in DatasetGraphSimpleDB. It is for testing and verification. > It does not even have any indexing. It scans for all "find" operations. > > Then the question is how fast and how much effort. > > The StorageRDF (the abstraction of triples and quads) gives a basic level of > access but maybe the storage engine can do joins natively. > > The general purpose OpExecutor (SPARQL algebra execution) will work but does > not pass joins to the storage layer. It takes a storage-specific extension > of OpExecutor to do that. > > OpExecutorTDB2 extends OpExecutor to execute basic graph patterns, the block > of multiple triple patterns. > > Optimization of SPARQL is an open-ended area but a lot of the improvements > come from 2 optimizations: joins in BGPs, and filter placement. > > In TDB2, joins are performed with "node ids", not the RDF terms, Nodes, > themselves. NodeId is a fixed length 64 bit number; Nodes are variable length > strings. In fact, until it needs them, TDB2 does not retrieve the full node > details of an RDF term. So if the variables are linking patters together and > do not appear in filters or the final results, they never get retrieved. > > Doing joins better includes reordering to execute in a better order, and > maybe extending to leftjoins (OPTIONAL). > > Filter placement, especially noticeable for the BSBM benchmark, is also > significant. It is pruning work as soon as possible. > > RDMS: > > The only general purpose SQL-related storage I know of that still exist work > by having support for SPARQL execution inside the SQL engine itself, not > layered on top. Jena had SDB which was layered but performance for both > loading and query just wasn't good and scaling was poor. Too much overhead > crossing the RDF-SQL boundaries. > > If however the data has an SQL schema, then R2RML is practical. Now the SQL > engine can employ native indexing and optimizations because it "knows" the > data shapes. > > KV: > > There are two cases, depending on whether the keys are sorted (e.g. RocksDB, > LMDB and several others). > > TDB2 is uses sorted key indexes, and no value, only keys, for triples and > quads. > > If the keys are sorted, then storing the triples 3 times in SPO, POS and OSP > means there is always an index to match a triple pattern of concrete terms > and some wildcards. Two indexes are enough IF you assume that a pattern > always has a predicate. > > There are several read-centric systems that use 6 indexes for a graph of > triples. It means any sort order is available and they can always so a merge > join. > > If the KV store does not provide a way to use it directly as a way to solve a > pattern with wildcards, there is going to have to be some structure on top it > to do so. > > Andy >