Hello Samuele, [warning: I may be way off-road]
> I am starting to play a bit with the EuroVoc > > <http://eurovoc.europa.eu/> > > ontology in order to integrate it into OpenAIRE Orphan Record > Repository, for automatic keyword extraction for EU documents. > > This ontology is *big*! and multilingual. I can't even load it with > RDFLIB on my laptop (4GB of RAM). [...] Blame XML bloat (again). For dictionaries and such, that is, a large corpus of data that doesn't change so much, in other words, that it is not transactional, why don't you use specialised software? Enter http://dict.org, a protocol (http://www.dict.org/rfc2229.txt) and a canonical implementation for dictionaries, blazingly fast, veteran and well known (see for example http://packages.debian.org/dictd and http://packages.debian.org/dict), plus several other implementations (http://www.dict.org/w/software/start), among others in Python (even curl is also a dict client) Creating and indexing a dict server or about half a milion entries using the standard dict.org utilities takes less than a minute, and the searches are resolved in miliseconds, for postitive and negative or approximate answers, for example: $ time dict -h localhost 00075743be0748c4965848c62c2f5a70 1 definition found From unknown [md5sums]: 00075743be0748c4965848c62c2f5a70 00075743be0748c4965848c62c2f5a70 /mnt/VOLUM-I/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif 00075743be0748c4965848c62c2f5a70 /mnt/VOLUM-Ib/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif real 0m0.004s user 0m0.000s sys 0m0.000s $ time dict -h localhost 00075743be0748c4 No definitions found for "00075743be0748c4" real 0m0.004s user 0m0.000s sys 0m0.000s $ time dict -h localhost 00075743be0748c4965848c62c2f5a7 No definitions found for "00075743be0748c4965848c62c2f5a7", perhaps you mean: md5sums: 00075743be0748c4965848c62c2f5a70 real 0m0.006s user 0m0.000s sys 0m0.000s $ dict -h localhost -I dictd 1.10.11/rf on Linux 2.6.26-2-amd64 On nuix.uab.es: up 21+03:06:13, 813 forks (1.6/hour) Database Headwords Index Data Uncompressed md5sums 580225 23 MB 32 MB 165 MB $ dict -h dict.org -I dictd 1.9.15/rf on Linux 2.6.30-bpo.1-686 On miranda.org: up 51+17:54:33, 16914217 forks (13619.5/hour) Database Headwords Index Data Uncompressed gcide 203645 3859 kB 12 MB 38 MB wn 154563 3089 kB 8744 kB 26 MB moby-thes 30263 528 kB 10 MB 28 MB elements 130 2 kB 14 kB 45 kB vera 9203 103 kB 160 kB 558 kB jargon 2374 42 kB 621 kB 1430 kB [...] Part of this fast speed is that the input file for creating the dictionary is sorted, and then it does binary searches on a mmapped file. As the protocol is inherently client-server, the same ontology (dictionary) can be (re-)used among different Invenio instances. It is not a toy. I haven't been able to make any noticeable use in my instance even massively querying it. You can follow part of my experiments here: http://news.gmane.org/gmane.network.protocols.dict.user Sorry, I had to say it, Ferran
