Hi Arne,
On 19/05/2023 21:21, Arne Bernhardt wrote:
Hi,
in a recent response
<https://github.com/apache/jena/issues/1867#issuecomment-1546931793> to an
issue it was said that "Fuseki - uses DatasetGraphInMemory mostly" .
For my PR <https://github.com/apache/jena/pull/1865>, I added a JMH
benchmark suite to the project. So it was easy for me to compare the
performance of GraphMem with
"DatasetGraphFactory.createTxnMem().getDefaultGraph()".
DatasetGraphInMemory is much slower in every discipline tested (#add,
#delete, #contains, #find, #stream).
Maybe my approach is too naive?
I understand very well that the underlying Dexx Collections Framework, with
its immutable persistent data structures, makes threading and transaction
handling easy
DatasetGraphInMemory (TIM = Transactions In Memory) has one big advantage.
It supports multiple-readers and a single-writer (MR+SW) at the same
time - truly concurrent. So does TDB2 (TDB1 is sort of hybrid).
MR+SW has a cost which is a copy-on-write overhead, a reader-centric
design choice allowing the readers to run latch-free.
You can't directly use a regular hash map with concurrent updates. (And
no, ConcurrentHashMap does not solve all problems, even for a single
datastructure. A dataset needs to coordinate changes to multiple
datastructure into a single transactional unit.
GraphMem can not do MR+SW - for all storage datasets/graphs that do not
have built-in for MR+SW, the best that can be done is MRSW -
multiple-readers or a single-writer.
For MRSW, when a writer starts, the system has to hold up subsequent
readers, let existing ones finish, then let the writer run, then release
any readers held up. (variations possible - whether readers or writers
get priority).
This is bad in a general concurrent environment. e.g. Fuseki.
One writer can "accidently" lock-out the dataset.
Maybe the application isn't doing updates, in which case, a memory
dataset focuses on read throughput is better, especially with better
triple density in memory.
Maybe the application is single threaded or can control threads itself
(non-Fuseki).
and that there are no issues with consuming iterators or
streams even after a read transaction has closed.
Continuing to use an iterator after the end of a transaction should not
be allowed.
Is it currently supported for consumers to use iterators and streams after
a transaction has been closed?
Consumers that want this must copy the iterator - it's an explicit opt-in.
Does this happen with Dexx? It may do, because Dexx relies on the
garbage collector so some things just happen.
If so, I don't currently see an easy way to
replace DatasetGraphInMemory with a faster implementation. (although
transaction-aware iterators that copy the remaining elements into lists
could be an option).
copy-iterators are going to be expensive in RAM - a denial of service
issue - and speed (lesser issue, possibly).
Are there other reasons why DatasetGraphInMemory is the preferred dataset
implementation for Fuseki?
MR+SW in an environment where there is no other information about
requirements is the safe choice.
If an app wants to trade the issues of MRSW for better performance, it
is a choice it needs to make. One case for Fuseki is publishing
relatively static data - e.g. reference data, changes from a known, well
behaved, application
Both a general purpose TIM and a higher density, faster dataset have
their places.
Andy
Cheers,
Arne