Hi Arne,

On 19/05/2023 21:21, Arne Bernhardt wrote:
Hi,
in a recent  response
<https://github.com/apache/jena/issues/1867#issuecomment-1546931793> to an
issue it was said that   "Fuseki - uses DatasetGraphInMemory mostly"  .
For my  PR <https://github.com/apache/jena/pull/1865>, I added a JMH
benchmark suite to the project. So it was easy for me to compare the
performance of GraphMem with
"DatasetGraphFactory.createTxnMem().getDefaultGraph()".
DatasetGraphInMemory is much slower in every discipline tested (#add,
#delete, #contains, #find, #stream).
Maybe my approach is too naive?
I understand very well that the underlying Dexx Collections Framework, with
its immutable persistent data structures, makes threading and transaction
handling easy

DatasetGraphInMemory (TIM = Transactions In Memory) has one big advantage.

It supports multiple-readers and a single-writer (MR+SW) at the same time - truly concurrent. So does TDB2 (TDB1 is sort of hybrid).

MR+SW has a cost which is a copy-on-write overhead, a reader-centric design choice allowing the readers to run latch-free.

You can't directly use a regular hash map with concurrent updates. (And no, ConcurrentHashMap does not solve all problems, even for a single datastructure. A dataset needs to coordinate changes to multiple datastructure into a single transactional unit.

GraphMem can not do MR+SW - for all storage datasets/graphs that do not have built-in for MR+SW, the best that can be done is MRSW - multiple-readers or a single-writer.

For MRSW, when a writer starts, the system has to hold up subsequent readers, let existing ones finish, then let the writer run, then release any readers held up. (variations possible - whether readers or writers get priority).

This is bad in a general concurrent environment. e.g. Fuseki.

One writer can "accidently" lock-out the dataset.

Maybe the application isn't doing updates, in which case, a memory dataset focuses on read throughput is better, especially with better triple density in memory.

Maybe the application is single threaded or can control threads itself (non-Fuseki).

and that there are no issues with consuming iterators or
streams even after a read transaction has closed.

Continuing to use an iterator after the end of a transaction should not be allowed.

Is it currently supported for consumers to use iterators and streams after
a transaction has been closed?

Consumers that want this must copy the iterator - it's an explicit opt-in.

Does this happen with Dexx? It may do, because Dexx relies on the garbage collector so some things just happen.

If so, I don't currently see an easy way to
replace DatasetGraphInMemory with a faster implementation. (although
transaction-aware iterators that copy the remaining elements into lists
could be an option).

copy-iterators are going to be expensive in RAM - a denial of service issue - and speed (lesser issue, possibly).

Are there other reasons why DatasetGraphInMemory is the preferred dataset
implementation for Fuseki?

MR+SW in an environment where there is no other information about requirements is the safe choice.

If an app wants to trade the issues of MRSW for better performance, it is a choice it needs to make. One case for Fuseki is publishing relatively static data - e.g. reference data, changes from a known, well behaved, application

Both a general purpose TIM and a higher density, faster dataset have their places.

    Andy


Cheers,
Arne

Reply via email to