Re: Why DatasetGraphInMemory?

Andy Seaborne Sat, 20 May 2023 06:19:57 -0700

Hi Arne,

On 19/05/2023 21:21, Arne Bernhardt wrote:

Hi,
in a recent  response
<https://github.com/apache/jena/issues/1867#issuecomment-1546931793> to an
issue it was said that   "Fuseki - uses DatasetGraphInMemory mostly"  .
For my  PR <https://github.com/apache/jena/pull/1865>, I added a JMH
benchmark suite to the project. So it was easy for me to compare the
performance of GraphMem with
"DatasetGraphFactory.createTxnMem().getDefaultGraph()".
DatasetGraphInMemory is much slower in every discipline tested (#add,
#delete, #contains, #find, #stream).
Maybe my approach is too naive?
I understand very well that the underlying Dexx Collections Framework, with
its immutable persistent data structures, makes threading and transaction
handling easy


DatasetGraphInMemory (TIM = Transactions In Memory) has one big advantage.

It supports multiple-readers and a single-writer (MR+SW) at the sametime - truly concurrent. So does TDB2 (TDB1 is sort of hybrid).

MR+SW has a cost which is a copy-on-write overhead, a reader-centricdesign choice allowing the readers to run latch-free.

You can't directly use a regular hash map with concurrent updates. (Andno, ConcurrentHashMap does not solve all problems, even for a singledatastructure. A dataset needs to coordinate changes to multipledatastructure into a single transactional unit.

GraphMem can not do MR+SW - for all storage datasets/graphs that do nothave built-in for MR+SW, the best that can be done is MRSW -multiple-readers or a single-writer.

For MRSW, when a writer starts, the system has to hold up subsequentreaders, let existing ones finish, then let the writer run, then releaseany readers held up. (variations possible - whether readers or writersget priority).


This is bad in a general concurrent environment. e.g. Fuseki.

One writer can "accidently" lock-out the dataset.

Maybe the application isn't doing updates, in which case, a memorydataset focuses on read throughput is better, especially with bettertriple density in memory.

Maybe the application is single threaded or can control threads itself(non-Fuseki).

and that there are no issues with consuming iterators or
streams even after a read transaction has closed.

Continuing to use an iterator after the end of a transaction should notbe allowed.

Is it currently supported for consumers to use iterators and streams after
a transaction has been closed?


Consumers that want this must copy the iterator - it's an explicit opt-in.

Does this happen with Dexx? It may do, because Dexx relies on thegarbage collector so some things just happen.

If so, I don't currently see an easy way to
replace DatasetGraphInMemory with a faster implementation. (although
transaction-aware iterators that copy the remaining elements into lists
could be an option).

copy-iterators are going to be expensive in RAM - a denial of serviceissue - and speed (lesser issue, possibly).

Are there other reasons why DatasetGraphInMemory is the preferred dataset
implementation for Fuseki?

MR+SW in an environment where there is no other information aboutrequirements is the safe choice.

If an app wants to trade the issues of MRSW for better performance, itis a choice it needs to make. One case for Fuseki is publishingrelatively static data - e.g. reference data, changes from a known, wellbehaved, application

Both a general purpose TIM and a higher density, faster dataset havetheir places.


    Andy


Cheers,
Arne

Re: Why DatasetGraphInMemory?

Reply via email to