-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks! This is exactly what I needed to hear. I will try out Urify pronto.

Is there a convenient place I can add some documentation about these issues 
with a pointer to Urify? Perhaps in the README for the generic RDF indexer?

- ---
A. Soroka
The University of Virginia Library

On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote:

> On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> 
> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Hi, Stanbol folks!
>> 
>> I'm trying to index a largeish (about 260M triples) dataset into a 
>> Solr-backed EntityHub [1], but not having much success. I'm getting "out of 
>> heap" errors in the load-to-Jena stage, even with a 4GB heap. The process 
>> doesn't make it past about 33M triples.
>> 
>> I know that Jena can scale way, way beyond that size, so I'm wondering if 
>> anyone else has tried datasets of similar size with success? Is it possible 
>> that there's a memory leak in the generic RDF indexer code?
>> 
>> I've considered trying to break up the dataset, but it's full of blank 
>> nodes, which makes that trickier, and I'm not at all confident that I could 
>> successfully merge the resulting Solr indexes to make a coherent EntityHub 
>> Solr core.
> 
> The blank nodes are the reason for the OOM errors, as Jena needs to
> keep all blank nodes in memory when parsing the RDF file. I had a
> similar problem when importing Musicbrainz with > 250 million Bnodes.
> 
> Because of that I created a small utility that converts BNodes to
> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify)
> and is part of the entityhub.indexing.core module. I have always run
> it from eclipse, but you should be also able to run it with java by
> putting one of the Entityhub Indexing Tool runnable jars in the
> classpath.
> 
> The other possibility is to increase the heap memory so that all
> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub
> does also not support Bnodes. Therefore therefore those node would get
> ignored - or if enabled - be converted to URNs during the indexing
> step (see STANBOL-765 [1])
> 
> So my advice would be to use the Urify utility to transcode the RDF
> dump before importing the data JenaTDB
> 
> best
> Rupert
> 
> [1] https://issues.apache.org/jira/browse/STANBOL-765
> 
>> 
>> I'd be grateful for any advice or suggestions as to other routes to take 
>> (other than trying to assemble an even larger heap for the process, which is 
>> not a very good long-term solution). For example, is there a supported way 
>> to index into a Clerezza-backed EntityHub, which would let me tackle the 
>> problem of loading into Jena TDB without using Stanbol gear?
>> 
>> Thanks!
>> 
>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
>> 
>> - ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>> Comment: GPGTools - http://gpgtools.org
>> 
>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P
>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9
>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF
>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS
>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O
>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w=
>> =S2QM
>> -----END PGP SIGNATURE-----
> 
> 
> 
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm
csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S
Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP
nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi
06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4
cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o=
=J3qL
-----END PGP SIGNATURE-----

Reply via email to