Re: large dataset into EntityHub

aj...@virginia.edu Wed, 24 Jul 2013 09:33:46 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Okay, I've made those changes (and improved the help for Urify a bit) at:


https://github.com/ajs6f/stanbol/tree/UrifyImprovements

but because that's a fork of Stanbol's repo, I don't think I can issue a pull 
request to y'all for it.

On the main topic, as I understand your explanation, it _should_ be possible to 
load a dataset with massive numbers of blank nodes into Jena without swamping 
the heap, but it would require that Jena persist its store of blank nodes to 
disk while the import is going on, which it doesn't do and which would be 
horribly slow. Is that a correct understanding?

Thanks very much for your help!

- ---
A. Soroka
The University of Virginia Library

On Jul 24, 2013, at 9:43 AM, Rupert Westenthaler wrote:

> On Wed, Jul 24, 2013 at 2:44 PM, aj...@virginia.edu <aj...@virginia.edu> 
> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Thanks! This is exactly what I needed to hear. I will try out Urify pronto.
>> 
>> Is there a convenient place I can add some documentation about these issues 
>> with a pointer to Urify? Perhaps in the README for the generic RDF indexer?
>> 
> 
> I think the README would be a good place to add such information. The
> problem with importing datasets with a lot of Bnodes is nothing
> Stanbol nor Jena specific. AFAIK all RDF frameworks are affected by
> that.
> 
> best
> Rupert
> 
>> - ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote:
>> 
>>> On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> 
>>> wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>> 
>>>> Hi, Stanbol folks!
>>>> 
>>>> I'm trying to index a largeish (about 260M triples) dataset into a 
>>>> Solr-backed EntityHub [1], but not having much success. I'm getting "out 
>>>> of heap" errors in the load-to-Jena stage, even with a 4GB heap. The 
>>>> process doesn't make it past about 33M triples.
>>>> 
>>>> I know that Jena can scale way, way beyond that size, so I'm wondering if 
>>>> anyone else has tried datasets of similar size with success? Is it 
>>>> possible that there's a memory leak in the generic RDF indexer code?
>>>> 
>>>> I've considered trying to break up the dataset, but it's full of blank 
>>>> nodes, which makes that trickier, and I'm not at all confident that I 
>>>> could successfully merge the resulting Solr indexes to make a coherent 
>>>> EntityHub Solr core.
>>> 
>>> The blank nodes are the reason for the OOM errors, as Jena needs to
>>> keep all blank nodes in memory when parsing the RDF file. I had a
>>> similar problem when importing Musicbrainz with > 250 million Bnodes.
>>> 
>>> Because of that I created a small utility that converts BNodes to
>>> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify)
>>> and is part of the entityhub.indexing.core module. I have always run
>>> it from eclipse, but you should be also able to run it with java by
>>> putting one of the Entityhub Indexing Tool runnable jars in the
>>> classpath.
>>> 
>>> The other possibility is to increase the heap memory so that all
>>> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub
>>> does also not support Bnodes. Therefore therefore those node would get
>>> ignored - or if enabled - be converted to URNs during the indexing
>>> step (see STANBOL-765 [1])
>>> 
>>> So my advice would be to use the Urify utility to transcode the RDF
>>> dump before importing the data JenaTDB
>>> 
>>> best
>>> Rupert
>>> 
>>> [1] https://issues.apache.org/jira/browse/STANBOL-765
>>> 
>>>> 
>>>> I'd be grateful for any advice or suggestions as to other routes to take 
>>>> (other than trying to assemble an even larger heap for the process, which 
>>>> is not a very good long-term solution). For example, is there a supported 
>>>> way to index into a Clerezza-backed EntityHub, which would let me tackle 
>>>> the problem of loading into Jena TDB without using Stanbol gear?
>>>> 
>>>> Thanks!
>>>> 
>>>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
>>>> 
>>>> - ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>>> Comment: GPGTools - http://gpgtools.org
>>>> 
>>>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P
>>>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9
>>>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF
>>>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS
>>>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O
>>>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w=
>>>> =S2QM
>>>> -----END PGP SIGNATURE-----
>>> 
>>> 
>>> 
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>> Comment: GPGTools - http://gpgtools.org
>> 
>> iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm
>> csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S
>> Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP
>> nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi
>> 06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4
>> cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o=
>> =J3qL
>> -----END PGP SIGNATURE-----
> 
> 
> 
> -- 
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJR8AGoAAoJEATpPYSyaoIk3Q4IAJ+ySsMM6sbaB5Rt9d5Fky8I
gSIQB2697hZJRYGYLXQl9RLqyC8UxRPCYS1u5RypthySto7GPKA22jwR8hCoCRlF
xAKbJmxWGpK0hoLyIc21oGhg0mF1Co2dDFPSD0L1z92/+iS6gXyDjYdgoZ3iQKcT
k5N0d/BmzQTAKXVCLYaBIxXodP4UtBu/XUO32gWg+ghSU8TKbfOCTzGncD5YzGVD
5lPZWMfO1JunSPk1ZkJOsB0pWoSFVOKP5yfcfJ2ygT4xH3m3WrI8iwYiH7Iw+tnp
P07Rs99Mm4/doIx+Jzrcxeob2dOTBIZxIQ5Dh7MZkXoQQs1QtiNOEG3Prqa4Iwo=
=u9de
-----END PGP SIGNATURE-----

Re: large dataset into EntityHub

Reply via email to