Hi On Wed, Jul 24, 2013 at 6:32 PM, aj...@virginia.edu <aj...@virginia.edu> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Okay, I've made those changes (and improved the help for Urify a bit) at: > > https://github.com/ajs6f/stanbol/tree/UrifyImprovements > > but because that's a fork of Stanbol's repo, I don't think I can issue a pull > request to y'all for it. >
Do you know how to create a patch. If so it would be good if you could create such a patch and attach it to a Jira. > On the main topic, as I understand your explanation, it _should_ be possible > to load a dataset with massive numbers of blank nodes into Jena without > swamping the heap, but it would require that Jena persist its store of blank > nodes to disk while the import is going on, which it doesn't do and which > would be horribly slow. Is that a correct understanding? > At least this is my understanding. For details I would ask this same question on the Apache Jena mailing list. best Rupert > Thanks very much for your help! > > - --- > A. Soroka > The University of Virginia Library > > On Jul 24, 2013, at 9:43 AM, Rupert Westenthaler wrote: > >> On Wed, Jul 24, 2013 at 2:44 PM, aj...@virginia.edu <aj...@virginia.edu> >> wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Thanks! This is exactly what I needed to hear. I will try out Urify pronto. >>> >>> Is there a convenient place I can add some documentation about these issues >>> with a pointer to Urify? Perhaps in the README for the generic RDF indexer? >>> >> >> I think the README would be a good place to add such information. The >> problem with importing datasets with a lot of Bnodes is nothing >> Stanbol nor Jena specific. AFAIK all RDF frameworks are affected by >> that. >> >> best >> Rupert >> >>> - --- >>> A. Soroka >>> The University of Virginia Library >>> >>> On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote: >>> >>>> On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> >>>> wrote: >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> Hi, Stanbol folks! >>>>> >>>>> I'm trying to index a largeish (about 260M triples) dataset into a >>>>> Solr-backed EntityHub [1], but not having much success. I'm getting "out >>>>> of heap" errors in the load-to-Jena stage, even with a 4GB heap. The >>>>> process doesn't make it past about 33M triples. >>>>> >>>>> I know that Jena can scale way, way beyond that size, so I'm wondering if >>>>> anyone else has tried datasets of similar size with success? Is it >>>>> possible that there's a memory leak in the generic RDF indexer code? >>>>> >>>>> I've considered trying to break up the dataset, but it's full of blank >>>>> nodes, which makes that trickier, and I'm not at all confident that I >>>>> could successfully merge the resulting Solr indexes to make a coherent >>>>> EntityHub Solr core. >>>> >>>> The blank nodes are the reason for the OOM errors, as Jena needs to >>>> keep all blank nodes in memory when parsing the RDF file. I had a >>>> similar problem when importing Musicbrainz with > 250 million Bnodes. >>>> >>>> Because of that I created a small utility that converts BNodes to >>>> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify) >>>> and is part of the entityhub.indexing.core module. I have always run >>>> it from eclipse, but you should be also able to run it with java by >>>> putting one of the Entityhub Indexing Tool runnable jars in the >>>> classpath. >>>> >>>> The other possibility is to increase the heap memory so that all >>>> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub >>>> does also not support Bnodes. Therefore therefore those node would get >>>> ignored - or if enabled - be converted to URNs during the indexing >>>> step (see STANBOL-765 [1]) >>>> >>>> So my advice would be to use the Urify utility to transcode the RDF >>>> dump before importing the data JenaTDB >>>> >>>> best >>>> Rupert >>>> >>>> [1] https://issues.apache.org/jira/browse/STANBOL-765 >>>> >>>>> >>>>> I'd be grateful for any advice or suggestions as to other routes to take >>>>> (other than trying to assemble an even larger heap for the process, which >>>>> is not a very good long-term solution). For example, is there a supported >>>>> way to index into a Clerezza-backed EntityHub, which would let me tackle >>>>> the problem of loading into Jena TDB without using Stanbol gear? >>>>> >>>>> Thanks! >>>>> >>>>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz >>>>> >>>>> - --- >>>>> A. Soroka >>>>> The University of Virginia Library >>>>> >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >>>>> Comment: GPGTools - http://gpgtools.org >>>>> >>>>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P >>>>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9 >>>>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF >>>>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS >>>>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O >>>>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w= >>>>> =S2QM >>>>> -----END PGP SIGNATURE----- >>>> >>>> >>>> >>>> -- >>>> | Rupert Westenthaler rupert.westentha...@gmail.com >>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>> | A-5500 Bischofshofen >>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >>> Comment: GPGTools - http://gpgtools.org >>> >>> iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm >>> csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S >>> Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP >>> nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi >>> 06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4 >>> cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o= >>> =J3qL >>> -----END PGP SIGNATURE----- >> >> >> >> -- >> | Rupert Westenthaler rupert.westentha...@gmail.com >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.19 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJR8AGoAAoJEATpPYSyaoIk3Q4IAJ+ySsMM6sbaB5Rt9d5Fky8I > gSIQB2697hZJRYGYLXQl9RLqyC8UxRPCYS1u5RypthySto7GPKA22jwR8hCoCRlF > xAKbJmxWGpK0hoLyIc21oGhg0mF1Co2dDFPSD0L1z92/+iS6gXyDjYdgoZ3iQKcT > k5N0d/BmzQTAKXVCLYaBIxXodP4UtBu/XUO32gWg+ghSU8TKbfOCTzGncD5YzGVD > 5lPZWMfO1JunSPk1ZkJOsB0pWoSFVOKP5yfcfJ2ygT4xH3m3WrI8iwYiH7Iw+tnp > P07Rs99Mm4/doIx+Jzrcxeob2dOTBIZxIQ5Dh7MZkXoQQs1QtiNOEG3Prqa4Iwo= > =u9de > -----END PGP SIGNATURE----- -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen