Hello Samuele,

[warning: I may be way off-road]

> I am starting to play a bit with the EuroVoc 
>
> <http://eurovoc.europa.eu/>
>
> ontology in order to integrate it into OpenAIRE Orphan Record
> Repository, for automatic keyword extraction for EU documents.
>
> This ontology is *big*! and multilingual. I can't even load it with
> RDFLIB on my laptop (4GB of RAM).
[...]

Blame XML bloat (again).

For dictionaries and such, that is, a large corpus of data that doesn't
change so much, in other words, that it is not transactional, why don't
you use specialised software?

Enter http://dict.org, a protocol (http://www.dict.org/rfc2229.txt) and
a canonical implementation for dictionaries, blazingly fast, veteran and
well known (see for example http://packages.debian.org/dictd and
http://packages.debian.org/dict), plus several other implementations
(http://www.dict.org/w/software/start), among others in Python (even
curl is also a dict client)

Creating and indexing a dict server or about half a milion entries using
the standard dict.org utilities takes less than a minute, and the
searches are resolved in miliseconds, for postitive and negative or
approximate answers, for example:

 $ time dict -h localhost 00075743be0748c4965848c62c2f5a70
 1 definition found

 From unknown [md5sums]:

  00075743be0748c4965848c62c2f5a70
     00075743be0748c4965848c62c2f5a70  
/mnt/VOLUM-I/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif
     00075743be0748c4965848c62c2f5a70  
/mnt/VOLUM-Ib/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif

 real   0m0.004s
 user   0m0.000s
 sys    0m0.000s

 $ time dict -h localhost 00075743be0748c4
 No definitions found for "00075743be0748c4"

 real   0m0.004s
 user   0m0.000s
 sys    0m0.000s

 $ time dict -h localhost 00075743be0748c4965848c62c2f5a7
 No definitions found for "00075743be0748c4965848c62c2f5a7", perhaps you mean:
 md5sums:  00075743be0748c4965848c62c2f5a70

 real   0m0.006s
 user   0m0.000s
 sys    0m0.000s

 $ dict -h localhost -I
  dictd 1.10.11/rf on Linux 2.6.26-2-amd64
  On nuix.uab.es: up 21+03:06:13, 813 forks (1.6/hour)
  
  Database      Headwords         Index     Data  Uncompressed
  md5sums      580225              23 MB    32 MB        165 MB


 $ dict -h dict.org -I
  dictd 1.9.15/rf on Linux 2.6.30-bpo.1-686
  On miranda.org: up 51+17:54:33, 16914217 forks (13619.5/hour)
  
  Database      Headwords         Index          Data  Uncompressed
  gcide          203645       3859 kB         12 MB         38 MB
  wn             154563       3089 kB       8744 kB         26 MB
  moby-thes       30263        528 kB         10 MB         28 MB
  elements          130          2 kB         14 kB         45 kB
  vera             9203        103 kB        160 kB        558 kB
  jargon           2374         42 kB        621 kB       1430 kB
  [...]


Part of this fast speed is that the input file for creating the
dictionary is sorted, and then it does binary searches on a mmapped
file.

As the protocol is inherently client-server, the same ontology
(dictionary) can be (re-)used among different Invenio instances.  It is
not a toy.  I haven't been able to make any noticeable use in my
instance even massively querying it.  You can follow part of my
experiments here:
http://news.gmane.org/gmane.network.protocols.dict.user

Sorry, I had to say it,

Ferran

Reply via email to