Re: large dataset into EntityHub

Rupert Westenthaler Tue, 23 Jul 2013 21:25:19 -0700

On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi, Stanbol folks!
>
> I'm trying to index a largeish (about 260M triples) dataset into a 
> Solr-backed EntityHub [1], but not having much success. I'm getting "out of 
> heap" errors in the load-to-Jena stage, even with a 4GB heap. The process 
> doesn't make it past about 33M triples.
>
> I know that Jena can scale way, way beyond that size, so I'm wondering if 
> anyone else has tried datasets of similar size with success? Is it possible 
> that there's a memory leak in the generic RDF indexer code?
>
> I've considered trying to break up the dataset, but it's full of blank nodes, 
> which makes that trickier, and I'm not at all confident that I could 
> successfully merge the resulting Solr indexes to make a coherent EntityHub 
> Solr core.


The blank nodes are the reason for the OOM errors, as Jena needs to
keep all blank nodes in memory when parsing the RDF file. I had a
similar problem when importing Musicbrainz with > 250 million Bnodes.

Because of that I created a small utility that converts BNodes to
URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify)
and is part of the entityhub.indexing.core module. I have always run
it from eclipse, but you should be also able to run it with java by
putting one of the Entityhub Indexing Tool runnable jars in the
classpath.

The other possibility is to increase the heap memory so that all
Bnodes do fit into memory. However NOTE that the Stanbol Entityhub
does also not support Bnodes. Therefore therefore those node would get
ignored - or if enabled - be converted to URNs during the indexing
step (see STANBOL-765 [1])

So my advice would be to use the Urify utility to transcode the RDF
dump before importing the data JenaTDB

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-765

>
> I'd be grateful for any advice or suggestions as to other routes to take 
> (other than trying to assemble an even larger heap for the process, which is 
> not a very good long-term solution). For example, is there a supported way to 
> index into a Clerezza-backed EntityHub, which would let me tackle the problem 
> of loading into Jena TDB without using Stanbol gear?
>
> Thanks!
>
> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
>
> - ---
> A. Soroka
> The University of Virginia Library
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P
> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9
> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF
> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS
> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O
> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w=
> =S2QM
> -----END PGP SIGNATURE-----



--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: large dataset into EntityHub

Reply via email to