Re: Is Clerezza leaking memory?

Andy Seaborne Fri, 29 Nov 2013 09:16:32 -0800

On 29/11/13 12:31, Minto van der Sluis wrote:

Andy Seaborne schreef op 29-11-2013 9:39:

On 28/11/13 13:17, Minto van der Sluis wrote:

Hi,

I just ran into some peculiar behavior.

For my current project I have to import 633 files each containing approx 20 MB of xml data (a total of 13 GB). When importing this data into a single graph I hit an out of memory exception on the 7th file.

Looking at the heap I noticed that after restarting the application I could load a few more files. So I started looking for the bundle that consumed all the memory. It happened to be the Clerezza TDB Storage provider. See the following image (GC = garbage collection):

Looking more closely I noticed that Apache Jena is able to close a graph (graph.close()) But Clerezza is not using this feature and is keeping the graph open all the time.

Jena graphs backed by TDB are simply views of the dataset - they don't have any state associated with them directly. If the reference become inaccessible, GC should clean up.
Hi Andy,

The problem, as far as I can tell, is not in Jena TDB itself. The Jena TDB bundle is still active/running. Only the Clerezza TDB Provider bundle is stopped (by me). Like my image shows a normal GC does not release all of the memory. Only after stopping the Clerezza TDB Provider memory allocated for importing is release. Because of stopping this particular bundle all jena datastructures become inaccessible and eligible for GC. Just like the image shows.

My reasoning is that since the Clerezza TDB Provider has a map with weak references to Jena models these references are never properly garbage collected. Since I use the same graph all the time all data gets accumulated and resulting in out of memory. Looking at a memory dump, most space is occupied by byte arrays containing the imported data.

I use a nasty hack to prevent this dreaded out of memory. After every import I restart the Clerezza TDB Provider bundle programmatically (hail OSGI for I wouldn't know how to do this without OSGI). Like this I have been able to import more that 300 files in a row (still running).

Regards,

Minto

It does look like something in Clerezza is holding memory. Do note that TDB has internal caches so it wil grow a well. Dataset are kept around because they are expensive to re-warm up, and the node table cache is in-heap. Other caches are not in-heap (64 bit mode)

If you want to bulk import, you could load the TDB database directly, using the bulk loader. Indeed, it can be worthwhile taking the input, creating an N-Quads file, with lots of checking and validation of the data, then loading the N-Quads. It's annoying to get part way though a large load and find the data isn't perfect.

Andy

Re: Is Clerezza leaking memory?

Reply via email to