Re: Optimizing TDB memory usage for large data sets

Stephen Allen Fri, 16 Aug 2013 12:20:50 -0700

On Thu, Aug 15, 2013 at 8:47 AM, Andy Seaborne <a...@apache.org> wrote:
>
> On 15/08/13 10:21, Knut-Olav Hoven wrote:
>>
>> Hi!
>
>
> Hi there - thanks for the detailed report.
>
>
>>
>> Two issues, related to memory usage. Import and delete of large graphs.
>>
>> I am currently doing some tests with 128MB heap with a little over 1M
>> tuples.
>> I know I can throw a lot of memory onto the problem, but sooner or later I
>> will run out.
>
>
> There are some fixed size caches (as you've discovered) - 128M is likely to 
> be to small for them.
>
>
>> I've noticed that TDB takes the complete resultset into memory when calling
>> "DatasetGraphTDB.deleteAny" before looping over all of them to delete them.
>> This makes a problem for very large graphs if I try to delete the entire
>> graph or a large selection.
>
>
> There is supposed to be a specific implement for deleteAny which is like 
> GraphTDB.removeWorker.  But there isn't.   Actually, I don't see why 
> GraphTDB.removeWorker needs to exist if a proper DatasetGraphTDB.deleteAny 
> existed.
>
> Recorded as JENA-513.
>
> I'll sort this out by moving the GraphTDB.removeWorker to DatasetGraphTDB and 
> use for deleteAny(...) and from GraphTDB.remove.
>
> The GraphTDB.removeWorker code gets batches of 1000 items, deletes them and 
> tries again until there is nothing more matching the delete pattern.  Deletes 
> are not done by iterator.
>


So as an alternative, you can use SPARQL Update combined with setting
the ARQ.spillToDiskThreshold parameter to a reasonable value (10,000
maybe?).  This will enable stream-to-disk functionality for the
intermediate bindings for DELETE/INSERT/WHERE queries (as well as
several of the SPARQL operators in the WHERE clause, see JENA-119).
This should eliminate memory bounds for the most part except for the
TDB's BlockMgrJournal.


> That said, having the code for iterator remove for RecordRangeIterator and in 
> TupleTable would be excellent regardless of this.  When I went looking for 
> BTree code originally, I found various possibilities but all too closely tied 
> to their usage to be reusable.  We could pull out the B+Tree code into a 
> reusable module.
>
> There are some RecordRangeIterator iterator cases that will not work with 
> Iterator.delete ... for example, when the B+Tree is not on the same machine 
> as the TupleIndex client.
>
>
>> I figured out a way to make the iterators backed by indexes/nodes and can
>> now delete each directly from the iterator. Just hope I have covered all
>> cases by implementing remove() in RecordRangeIterator and in TupleTable
>> (connected to all indexes). This was the "easy" part.
>>
>> The difficult part is the Transaction and Journal which doesn't write to
>> the journal before the transaction is just about to be committed. This
>> means that there becomes many Block objects kept in memory in the HashMap
>> "BlockMgrJournal.writeBlocks".
>
>
> Yes - this is a limitation of the current transaction system.  The blocks may 
> still be accessed so they can't be written to the journal and forgotten.  
> There could be a cache that knows where the block is in the journal and 
> fetches it back (minor but them the journal is jumbled and if in numerical 
> block order, the writes for flushing back to the disk are likely more 
> efficient).
>
> My very long term approach would be to use immutable B+Trees where the blocks 
> tree to the root are copied when a block first changes.  This means that 
> transactional data is written once, during the write transaction.  Commit 
> means switch to the new root for all subsequent transactions.  Old trees 
> remain.  The hard part is that tree needs to garbage collected.  Typically, 
> this is done by a background task writing a new copy.  c.f. CouchDB, BDB-JE 
> (?) and Mulgara (not B+Trees but same approach) amongst others.
>
> This is a not insignificant rewrite of the B+Tree ad BlockMgr code.
>
> If there were a spill cache for BlockMgrJournal that would be a great thing 
> to have.  It's a much more direct way to get scalable transactions and works 
> without a DB format change.
>

Agreed.  Unfortunately the *DataBag classes require all data to be
written before any reading occurs, which makes them inappropriate.
Can't we just use another disk-backed B+Tree as a temporary store here
instead of the in-memory HashMap?

I've actually been running into this issue because now that streaming
SPARQL Update support is available, I find I am generating and
streaming so much data in a single transaction that I need to devote a
not-insignificant amount of heap just for storing the pending blocks.

>
>> Trying to fix this by just writing to the journal directly results in
>> another issue in all those unit tests that open multiple transactions. The
>> problem is that the journal is not replayed onto the database files if
>> there are any transactions open. The reason for why BlockMgrJournal works
>> in those tests are that the writesBlocks HashMap are never cleared after
>> transaction (and the other transactions hit that one instead of the backing
>> files).
>>
>> I also encountered a case during import that led to a corrupt database that
>> I could not recorver. Always got an exception from "ObjectFileStorage.read"
>> telling me that I had an "Impossibly large object".
>>
>> Those cases always started with an OutOfMemoryError during import while
>> writing to the database files. By lowering the caches Node2NodeIdCacheSize
>> and NodeId2NodeCacheSize and splitting import files into smaller
>> batches/transactions it went fine. It seems to recover by just returning an
>> empty ByteBuffer instead of throwing the exception, but it would just cover
>> up a bad state I guess. Maybe there might be some optimization that can be
>> done to the part where the journal is spooled onto the database files to
>> avoid the OutOfMemoryError issue all together to avoid corrupt databases.
>
>
> Sorry - if "Impossibly large object" happens the database is unrecoverable.  
> The problem happened at write time - it's just detected at read time.
>
>
>> Should I open some issues in Jira?
>
>
> Please do.
>
>
>> I can provide some patches for the iterators remove() functions.
>
>
> Awesome.
>
>
>>
>>
>> Sincerely,
>>
>> Knut-Olav Hoven
>> NRK, Norwegian Broadcaster Corporation
>>
>
>         Andy
>
>

Re: Optimizing TDB memory usage for large data sets

Reply via email to