Hi Ferran! Il giorno ven, 17/12/2010 alle 13.59 +0100, Ferran Jorba ha scritto: > Blame XML bloat (again). > > For dictionaries and such, that is, a large corpus of data that doesn't > change so much, in other words, that it is not transactional, why don't > you use specialised software?
well in principle ontologies are more than dictionaries, in the sense that they are not just a mere list of terms, but they contain relationship among them (synonyms, contraries, hierarchies...), moreover they are curated and they do evolve in time as they usually contain a restricted set of officially recognized terms that have a sense in a certain context. T > Enter http://dict.org, a protocol (http://www.dict.org/rfc2229.txt) and > a canonical implementation for dictionaries, blazingly fast, veteran and > well known (see for example http://packages.debian.org/dictd and > http://packages.debian.org/dict), plus several other implementations > (http://www.dict.org/w/software/start), among others in Python (even > curl is also a dict client) > > Creating and indexing a dict server or about half a milion entries using > the standard dict.org utilities takes less than a minute, and the > searches are resolved in miliseconds, for postitive and negative or > approximate answers, for example: > > $ time dict -h localhost 00075743be0748c4965848c62c2f5a70 > 1 definition found > > From unknown [md5sums]: > > 00075743be0748c4965848c62c2f5a70 > 00075743be0748c4965848c62c2f5a70 > /mnt/VOLUM-I/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif > 00075743be0748c4965848c62c2f5a70 > /mnt/VOLUM-Ib/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif > > real 0m0.004s > user 0m0.000s > sys 0m0.000s > > $ time dict -h localhost 00075743be0748c4 > No definitions found for "00075743be0748c4" > > real 0m0.004s > user 0m0.000s > sys 0m0.000s > > $ time dict -h localhost 00075743be0748c4965848c62c2f5a7 > No definitions found for "00075743be0748c4965848c62c2f5a7", perhaps you mean: > md5sums: 00075743be0748c4965848c62c2f5a70 > > real 0m0.006s > user 0m0.000s > sys 0m0.000s > > $ dict -h localhost -I > dictd 1.10.11/rf on Linux 2.6.26-2-amd64 > On nuix.uab.es: up 21+03:06:13, 813 forks (1.6/hour) > > Database Headwords Index Data Uncompressed > md5sums 580225 23 MB 32 MB 165 MB > > > $ dict -h dict.org -I > dictd 1.9.15/rf on Linux 2.6.30-bpo.1-686 > On miranda.org: up 51+17:54:33, 16914217 forks (13619.5/hour) > > Database Headwords Index Data Uncompressed > gcide 203645 3859 kB 12 MB 38 MB > wn 154563 3089 kB 8744 kB 26 MB > moby-thes 30263 528 kB 10 MB 28 MB > elements 130 2 kB 14 kB 45 kB > vera 9203 103 kB 160 kB 558 kB > jargon 2374 42 kB 621 kB 1430 kB > [...] > > > Part of this fast speed is that the input file for creating the > dictionary is sorted, and then it does binary searches on a mmapped > file. > > As the protocol is inherently client-server, the same ontology > (dictionary) can be (re-)used among different Invenio instances. It is > not a toy. I haven't been able to make any noticeable use in my > instance even massively querying it. You can follow part of my > experiments here: > http://news.gmane.org/gmane.network.protocols.dict.user > > Sorry, I had to say it, actually this is a very cool hack :-) Indeed it might be very useful for the use case of bibclassify, where the goal is (AFAIK) to find the most represented keywords in a text. Indeed one might think to take the terms of these ontologies and create dictionaries for dict... I don't know though if this is can be used in semen (Roman?) but that is another topic :-) Cheers! Sam -- Samuele Kaplun Invenio Developer ** <http://invenio-software.org/>
