Hi! Two issues, related to memory usage. Import and delete of large graphs.
I am currently doing some tests with 128MB heap with a little over 1M tuples. I know I can throw a lot of memory onto the problem, but sooner or later I will run out. I've noticed that TDB takes the complete resultset into memory when calling "DatasetGraphTDB.deleteAny" before looping over all of them to delete them. This makes a problem for very large graphs if I try to delete the entire graph or a large selection. I figured out a way to make the iterators backed by indexes/nodes and can now delete each directly from the iterator. Just hope I have covered all cases by implementing remove() in RecordRangeIterator and in TupleTable (connected to all indexes). This was the "easy" part. The difficult part is the Transaction and Journal which doesn't write to the journal before the transaction is just about to be committed. This means that there becomes many Block objects kept in memory in the HashMap "BlockMgrJournal.writeBlocks". Trying to fix this by just writing to the journal directly results in another issue in all those unit tests that open multiple transactions. The problem is that the journal is not replayed onto the database files if there are any transactions open. The reason for why BlockMgrJournal works in those tests are that the writesBlocks HashMap are never cleared after transaction (and the other transactions hit that one instead of the backing files). I also encountered a case during import that led to a corrupt database that I could not recorver. Always got an exception from "ObjectFileStorage.read" telling me that I had an "Impossibly large object". Those cases always started with an OutOfMemoryError during import while writing to the database files. By lowering the caches Node2NodeIdCacheSize and NodeId2NodeCacheSize and splitting import files into smaller batches/transactions it went fine. It seems to recover by just returning an empty ByteBuffer instead of throwing the exception, but it would just cover up a bad state I guess. Maybe there might be some optimization that can be done to the part where the journal is spooled onto the database files to avoid the OutOfMemoryError issue all together to avoid corrupt databases. Should I open some issues in Jira? I can provide some patches for the iterators remove() functions. Sincerely, Knut-Olav Hoven NRK, Norwegian Broadcaster Corporation