Optimizing TDB memory usage for large data sets

Knut-Olav Hoven Thu, 15 Aug 2013 02:22:15 -0700

Hi!

Two issues, related to memory usage. Import and delete of large graphs.


I am currently doing some tests with 128MB heap with a little over 1M
tuples.
I know I can throw a lot of memory onto the problem, but sooner or later I
will run out.



I've noticed that TDB takes the complete resultset into memory when calling
"DatasetGraphTDB.deleteAny" before looping over all of them to delete them.
This makes a problem for very large graphs if I try to delete the entire
graph or a large selection.

I figured out a way to make the iterators backed by indexes/nodes and can
now delete each directly from the iterator. Just hope I have covered all
cases by implementing remove() in RecordRangeIterator and in TupleTable
(connected to all indexes). This was the "easy" part.

The difficult part is the Transaction and Journal which doesn't write to
the journal before the transaction is just about to be committed. This
means that there becomes many Block objects kept in memory in the HashMap
"BlockMgrJournal.writeBlocks".

Trying to fix this by just writing to the journal directly results in
another issue in all those unit tests that open multiple transactions. The
problem is that the journal is not replayed onto the database files if
there are any transactions open. The reason for why BlockMgrJournal works
in those tests are that the writesBlocks HashMap are never cleared after
transaction (and the other transactions hit that one instead of the backing
files).



I also encountered a case during import that led to a corrupt database that
I could not recorver. Always got an exception from "ObjectFileStorage.read"
telling me that I had an "Impossibly large object".

Those cases always started with an OutOfMemoryError during import while
writing to the database files. By lowering the caches Node2NodeIdCacheSize
and NodeId2NodeCacheSize and splitting import files into smaller
batches/transactions it went fine. It seems to recover by just returning an
empty ByteBuffer instead of throwing the exception, but it would just cover
up a bad state I guess. Maybe there might be some optimization that can be
done to the part where the journal is spooled onto the database files to
avoid the OutOfMemoryError issue all together to avoid corrupt databases.


Should I open some issues in Jira?


I can provide some patches for the iterators remove() functions.


Sincerely,

Knut-Olav Hoven
NRK, Norwegian Broadcaster Corporation

Optimizing TDB memory usage for large data sets

Reply via email to