Re: large dataset into EntityHub

Rupert Westenthaler Wed, 24 Jul 2013 21:31:21 -0700

Hi

On Wed, Jul 24, 2013 at 6:32 PM, [email protected] <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Okay, I've made those changes (and improved the help for Urify a bit) at:
>
> https://github.com/ajs6f/stanbol/tree/UrifyImprovements
>
> but because that's a fork of Stanbol's repo, I don't think I can issue a pull 
> request to y'all for it.
>


Do you know how to create a patch. If so it would be good if you could
create such a patch and attach it to a Jira.

> On the main topic, as I understand your explanation, it _should_ be possible 
> to load a dataset with massive numbers of blank nodes into Jena without 
> swamping the heap, but it would require that Jena persist its store of blank 
> nodes to disk while the import is going on, which it doesn't do and which 
> would be horribly slow. Is that a correct understanding?
>

At least this is my understanding. For details I would ask this same
question on the Apache Jena mailing list.

best
Rupert

> Thanks very much for your help!
>
> - ---
> A. Soroka
> The University of Virginia Library
>
> On Jul 24, 2013, at 9:43 AM, Rupert Westenthaler wrote:
>
>> On Wed, Jul 24, 2013 at 2:44 PM, [email protected] <[email protected]> 
>> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Thanks! This is exactly what I needed to hear. I will try out Urify pronto.
>>>
>>> Is there a convenient place I can add some documentation about these issues 
>>> with a pointer to Urify? Perhaps in the README for the generic RDF indexer?
>>>
>>
>> I think the README would be a good place to add such information. The
>> problem with importing datasets with a lot of Bnodes is nothing
>> Stanbol nor Jena specific. AFAIK all RDF frameworks are affected by
>> that.
>>
>> best
>> Rupert
>>
>>> - ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>> On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote:
>>>
>>>> On Tue, Jul 23, 2013 at 9:24 PM, [email protected] <[email protected]> 
>>>> wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>> Hi, Stanbol folks!
>>>>>
>>>>> I'm trying to index a largeish (about 260M triples) dataset into a 
>>>>> Solr-backed EntityHub [1], but not having much success. I'm getting "out 
>>>>> of heap" errors in the load-to-Jena stage, even with a 4GB heap. The 
>>>>> process doesn't make it past about 33M triples.
>>>>>
>>>>> I know that Jena can scale way, way beyond that size, so I'm wondering if 
>>>>> anyone else has tried datasets of similar size with success? Is it 
>>>>> possible that there's a memory leak in the generic RDF indexer code?
>>>>>
>>>>> I've considered trying to break up the dataset, but it's full of blank 
>>>>> nodes, which makes that trickier, and I'm not at all confident that I 
>>>>> could successfully merge the resulting Solr indexes to make a coherent 
>>>>> EntityHub Solr core.
>>>>
>>>> The blank nodes are the reason for the OOM errors, as Jena needs to
>>>> keep all blank nodes in memory when parsing the RDF file. I had a
>>>> similar problem when importing Musicbrainz with > 250 million Bnodes.
>>>>
>>>> Because of that I created a small utility that converts BNodes to
>>>> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify)
>>>> and is part of the entityhub.indexing.core module. I have always run
>>>> it from eclipse, but you should be also able to run it with java by
>>>> putting one of the Entityhub Indexing Tool runnable jars in the
>>>> classpath.
>>>>
>>>> The other possibility is to increase the heap memory so that all
>>>> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub
>>>> does also not support Bnodes. Therefore therefore those node would get
>>>> ignored - or if enabled - be converted to URNs during the indexing
>>>> step (see STANBOL-765 [1])
>>>>
>>>> So my advice would be to use the Urify utility to transcode the RDF
>>>> dump before importing the data JenaTDB
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [1] https://issues.apache.org/jira/browse/STANBOL-765
>>>>
>>>>>
>>>>> I'd be grateful for any advice or suggestions as to other routes to take 
>>>>> (other than trying to assemble an even larger heap for the process, which 
>>>>> is not a very good long-term solution). For example, is there a supported 
>>>>> way to index into a Clerezza-backed EntityHub, which would let me tackle 
>>>>> the problem of loading into Jena TDB without using Stanbol gear?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
>>>>>
>>>>> - ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>>>> Comment: GPGTools - http://gpgtools.org
>>>>>
>>>>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P
>>>>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9
>>>>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF
>>>>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS
>>>>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O
>>>>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w=
>>>>> =S2QM
>>>>> -----END PGP SIGNATURE-----
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             [email protected]
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>> Comment: GPGTools - http://gpgtools.org
>>>
>>> iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm
>>> csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S
>>> Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP
>>> nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi
>>> 06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4
>>> cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o=
>>> =J3qL
>>> -----END PGP SIGNATURE-----
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQEcBAEBAgAGBQJR8AGoAAoJEATpPYSyaoIk3Q4IAJ+ySsMM6sbaB5Rt9d5Fky8I
> gSIQB2697hZJRYGYLXQl9RLqyC8UxRPCYS1u5RypthySto7GPKA22jwR8hCoCRlF
> xAKbJmxWGpK0hoLyIc21oGhg0mF1Co2dDFPSD0L1z92/+iS6gXyDjYdgoZ3iQKcT
> k5N0d/BmzQTAKXVCLYaBIxXodP4UtBu/XUO32gWg+ghSU8TKbfOCTzGncD5YzGVD
> 5lPZWMfO1JunSPk1ZkJOsB0pWoSFVOKP5yfcfJ2ygT4xH3m3WrI8iwYiH7Iw+tnp
> P07Rs99Mm4/doIx+Jzrcxeob2dOTBIZxIQ5Dh7MZkXoQQs1QtiNOEG3Prqa4Iwo=
> =u9de
> -----END PGP SIGNATURE-----



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: large dataset into EntityHub

Reply via email to