|
On 29/11/13 12:31, Minto van der Sluis
wrote:
Andy Seaborne schreef op 29-11-2013
9:39:
On 28/11/13 13:17, Minto van der
Sluis wrote:
Hi,
I just ran into some peculiar behavior.
For my current project I have to import 633 files each
containing approx 20 MB of xml data (a total of 13 GB). When
importing this data into a single graph I hit an out of memory
exception on the 7th file.
Looking at the heap I noticed that after restarting the
application I could load a few more files. So I started
looking for the bundle that consumed all the memory. It
happened to be the Clerezza TDB Storage provider. See the
following image (GC = garbage collection):

Looking more closely I noticed that Apache Jena is able to
close a graph (graph.close()) But Clerezza is not using this
feature and is keeping the graph open all the time.
Jena graphs backed by TDB are simply views of the dataset - they
don't have any state associated with them directly. If the
reference become inaccessible, GC should clean up.
Hi Andy,
The problem, as far as I can tell, is not in Jena TDB itself. The
Jena TDB bundle is still active/running. Only the Clerezza TDB
Provider bundle is stopped (by me). Like my image shows a normal
GC does not release all of the memory. Only after stopping the
Clerezza TDB Provider memory allocated for importing is release.
Because of stopping this particular bundle all jena datastructures
become inaccessible and eligible for GC. Just like the image
shows.
My reasoning is that since the Clerezza TDB Provider has a map
with weak references to Jena models these references are never
properly garbage collected. Since I use the same graph all the
time all data gets accumulated and resulting in out of memory.
Looking at a memory dump, most space is occupied by byte arrays
containing the imported data.
I use a nasty hack to prevent this dreaded out of memory. After
every import I restart the Clerezza TDB Provider bundle
programmatically (hail OSGI for I wouldn't know how to do this
without OSGI). Like this I have been able to import more that 300
files in a row (still running).
Regards,
Minto
It does look like something in Clerezza is holding memory. Do note
that TDB has internal caches so it wil grow a well. Dataset are
kept around because they are expensive to re-warm up, and the node
table cache is in-heap. Other caches are not in-heap (64 bit mode)
If you want to bulk import, you could load the TDB database
directly, using the bulk loader. Indeed, it can be worthwhile
taking the input, creating an N-Quads file, with lots of checking
and validation of the data, then loading the N-Quads. It's annoying
to get part way though a large load and find the data isn't perfect.
Andy
|