On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, Stanbol folks! > > I'm trying to index a largeish (about 260M triples) dataset into a > Solr-backed EntityHub [1], but not having much success. I'm getting "out of > heap" errors in the load-to-Jena stage, even with a 4GB heap. The process > doesn't make it past about 33M triples. > > I know that Jena can scale way, way beyond that size, so I'm wondering if > anyone else has tried datasets of similar size with success? Is it possible > that there's a memory leak in the generic RDF indexer code? > > I've considered trying to break up the dataset, but it's full of blank nodes, > which makes that trickier, and I'm not at all confident that I could > successfully merge the resulting Solr indexes to make a coherent EntityHub > Solr core.
The blank nodes are the reason for the OOM errors, as Jena needs to keep all blank nodes in memory when parsing the RDF file. I had a similar problem when importing Musicbrainz with > 250 million Bnodes. Because of that I created a small utility that converts BNodes to URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify) and is part of the entityhub.indexing.core module. I have always run it from eclipse, but you should be also able to run it with java by putting one of the Entityhub Indexing Tool runnable jars in the classpath. The other possibility is to increase the heap memory so that all Bnodes do fit into memory. However NOTE that the Stanbol Entityhub does also not support Bnodes. Therefore therefore those node would get ignored - or if enabled - be converted to URNs during the indexing step (see STANBOL-765 [1]) So my advice would be to use the Urify utility to transcode the RDF dump before importing the data JenaTDB best Rupert [1] https://issues.apache.org/jira/browse/STANBOL-765 > > I'd be grateful for any advice or suggestions as to other routes to take > (other than trying to assemble an even larger heap for the process, which is > not a very good long-term solution). For example, is there a supported way to > index into a Clerezza-backed EntityHub, which would let me tackle the problem > of loading into Jena TDB without using Stanbol gear? > > Thanks! > > [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz > > - --- > A. Soroka > The University of Virginia Library > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.19 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P > x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9 > EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF > Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS > E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O > w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w= > =S2QM > -----END PGP SIGNATURE----- -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen