Hi Minto, It would be great if you could check if the problem is still there with the resolution I propose for CLEREZZA-871 (just pushed).
Cheers, Reto On Fri, Nov 29, 2013 at 6:15 PM, Andy Seaborne <[email protected]> wrote: > On 29/11/13 12:31, Minto van der Sluis wrote: > > Andy Seaborne schreef op 29-11-2013 9:39: > > On 28/11/13 13:17, Minto van der Sluis wrote: > > Hi, > > I just ran into some peculiar behavior. > > For my current project I have to import 633 files each containing approx > 20 MB of xml data (a total of 13 GB). When importing this data into a > single graph I hit an out of memory exception on the 7th file. > > Looking at the heap I noticed that after restarting the application I > could load a few more files. So I started looking for the bundle that > consumed all the memory. It happened to be the Clerezza TDB Storage > provider. See the following image (GC = garbage collection): > > > > > Looking more closely I noticed that Apache Jena is able to close a graph > (graph.close()) But Clerezza is not using this feature and is keeping the > graph open all the time. > > > Jena graphs backed by TDB are simply views of the dataset - they don't > have any state associated with them directly. If the reference become > inaccessible, GC should clean up. > > Hi Andy, > > The problem, as far as I can tell, is not in Jena TDB itself. The Jena TDB > bundle is still active/running. Only the Clerezza TDB Provider bundle is > stopped (by me). Like my image shows a normal GC does not release all of > the memory. Only after stopping the Clerezza TDB Provider memory allocated > for importing is release. Because of stopping this particular bundle all > jena datastructures become inaccessible and eligible for GC. Just like the > image shows. > > My reasoning is that since the Clerezza TDB Provider has a map with weak > references to Jena models these references are never properly garbage > collected. Since I use the same graph all the time all data gets > accumulated and resulting in out of memory. Looking at a memory dump, most > space is occupied by byte arrays containing the imported data. > > I use a nasty hack to prevent this dreaded out of memory. After every > import I restart the Clerezza TDB Provider bundle programmatically (hail > OSGI for I wouldn't know how to do this without OSGI). Like this I have > been able to import more that 300 files in a row (still running). > > Regards, > > Minto > > > It does look like something in Clerezza is holding memory. Do note that > TDB has internal caches so it wil grow a well. Dataset are kept around > because they are expensive to re-warm up, and the node table cache is > in-heap. Other caches are not in-heap (64 bit mode) > > If you want to bulk import, you could load the TDB database directly, > using the bulk loader. Indeed, it can be worthwhile taking the input, > creating an N-Quads file, with lots of checking and validation of the data, > then loading the N-Quads. It's annoying to get part way though a large load > and find the data isn't perfect. > > Andy > > > >
