-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'll send a PR with a little info in the README about how and when to use Urify.
- --- A. Soroka The University of Virginia Library On Jul 24, 2013, at 9:43 AM, Rupert Westenthaler wrote: > On Wed, Jul 24, 2013 at 2:44 PM, aj...@virginia.edu <aj...@virginia.edu> > wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Thanks! This is exactly what I needed to hear. I will try out Urify pronto. >> >> Is there a convenient place I can add some documentation about these issues >> with a pointer to Urify? Perhaps in the README for the generic RDF indexer? >> > > I think the README would be a good place to add such information. The > problem with importing datasets with a lot of Bnodes is nothing > Stanbol nor Jena specific. AFAIK all RDF frameworks are affected by > that. > > best > Rupert > >> - --- >> A. Soroka >> The University of Virginia Library >> >> On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote: >> >>> On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> >>> wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Hi, Stanbol folks! >>>> >>>> I'm trying to index a largeish (about 260M triples) dataset into a >>>> Solr-backed EntityHub [1], but not having much success. I'm getting "out >>>> of heap" errors in the load-to-Jena stage, even with a 4GB heap. The >>>> process doesn't make it past about 33M triples. >>>> >>>> I know that Jena can scale way, way beyond that size, so I'm wondering if >>>> anyone else has tried datasets of similar size with success? Is it >>>> possible that there's a memory leak in the generic RDF indexer code? >>>> >>>> I've considered trying to break up the dataset, but it's full of blank >>>> nodes, which makes that trickier, and I'm not at all confident that I >>>> could successfully merge the resulting Solr indexes to make a coherent >>>> EntityHub Solr core. >>> >>> The blank nodes are the reason for the OOM errors, as Jena needs to >>> keep all blank nodes in memory when parsing the RDF file. I had a >>> similar problem when importing Musicbrainz with > 250 million Bnodes. >>> >>> Because of that I created a small utility that converts BNodes to >>> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify) >>> and is part of the entityhub.indexing.core module. I have always run >>> it from eclipse, but you should be also able to run it with java by >>> putting one of the Entityhub Indexing Tool runnable jars in the >>> classpath. >>> >>> The other possibility is to increase the heap memory so that all >>> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub >>> does also not support Bnodes. Therefore therefore those node would get >>> ignored - or if enabled - be converted to URNs during the indexing >>> step (see STANBOL-765 [1]) >>> >>> So my advice would be to use the Urify utility to transcode the RDF >>> dump before importing the data JenaTDB >>> >>> best >>> Rupert >>> >>> [1] https://issues.apache.org/jira/browse/STANBOL-765 >>> >>>> >>>> I'd be grateful for any advice or suggestions as to other routes to take >>>> (other than trying to assemble an even larger heap for the process, which >>>> is not a very good long-term solution). For example, is there a supported >>>> way to index into a Clerezza-backed EntityHub, which would let me tackle >>>> the problem of loading into Jena TDB without using Stanbol gear? >>>> >>>> Thanks! >>>> >>>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz >>>> >>>> - --- >>>> A. Soroka >>>> The University of Virginia Library >>>> >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >>>> Comment: GPGTools - http://gpgtools.org >>>> >>>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P >>>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9 >>>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF >>>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS >>>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O >>>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w= >>>> =S2QM >>>> -----END PGP SIGNATURE----- >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >> Comment: GPGTools - http://gpgtools.org >> >> iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm >> csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S >> Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP >> nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi >> 06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4 >> cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o= >> =J3qL >> -----END PGP SIGNATURE----- > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJR7+IpAAoJEATpPYSyaoIkLZAIAMqIY1Z0RHxtIfuoc4LMnzqh S6BhSrV94/FjCGqG+yw5TAVkCxRhn75FHkh19z7jJjuetJGDZ5I8uLmT1DbBUPVo g5HayQtIKBDByD7JkxswAA/MOUxxstscbCO0wAtF/gpyBiJdbzpcRFsgLUSuymI6 58I0j2xwslXyqRPUo+mN0zhIH7Hjtl215Fcxn4mp7O5SqN6pWUM4DzUpHQYzufLQ w3s8YC+ONPqgd4qQnsov0aFZqMtqw9/90Yo2UEk6DsN+fba8Bg9DiaFeQgkaJ8+h 88hcii7J8HzvClzKiByBSKeYlxTLXtNu+rn0oWez1jjOAYwotFtRX11wuRNdU/c= =UWo9 -----END PGP SIGNATURE-----