Hi Ferran!

Il giorno ven, 17/12/2010 alle 13.59 +0100, Ferran Jorba ha scritto: 
> Blame XML bloat (again).
> 
> For dictionaries and such, that is, a large corpus of data that doesn't
> change so much, in other words, that it is not transactional, why don't
> you use specialised software?

well in principle ontologies are more than dictionaries, in the sense
that they are not just a mere list of terms, but they contain
relationship among them (synonyms, contraries, hierarchies...), moreover
they are curated and they do evolve in time as they usually contain a
restricted set of officially recognized terms that have a sense in a
certain context. T

> Enter http://dict.org, a protocol (http://www.dict.org/rfc2229.txt) and
> a canonical implementation for dictionaries, blazingly fast, veteran and
> well known (see for example http://packages.debian.org/dictd and
> http://packages.debian.org/dict), plus several other implementations
> (http://www.dict.org/w/software/start), among others in Python (even
> curl is also a dict client)
> 
> Creating and indexing a dict server or about half a milion entries using
> the standard dict.org utilities takes less than a minute, and the
> searches are resolved in miliseconds, for postitive and negative or
> approximate answers, for example:
> 
>  $ time dict -h localhost 00075743be0748c4965848c62c2f5a70
>  1 definition found
> 
>  From unknown [md5sums]:
> 
>   00075743be0748c4965848c62c2f5a70
>      00075743be0748c4965848c62c2f5a70  
> /mnt/VOLUM-I/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif
>      00075743be0748c4965848c62c2f5a70  
> /mnt/VOLUM-Ib/3-12/ddd/veterinaria/revhigsanvet/tif/revhigsanvet_a1915m11t5n8/revhigsanvet_a1915m11t5n8_21.tif
> 
>  real 0m0.004s
>  user 0m0.000s
>  sys  0m0.000s
> 
>  $ time dict -h localhost 00075743be0748c4
>  No definitions found for "00075743be0748c4"
> 
>  real 0m0.004s
>  user 0m0.000s
>  sys  0m0.000s
> 
>  $ time dict -h localhost 00075743be0748c4965848c62c2f5a7
>  No definitions found for "00075743be0748c4965848c62c2f5a7", perhaps you mean:
>  md5sums:  00075743be0748c4965848c62c2f5a70
> 
>  real 0m0.006s
>  user 0m0.000s
>  sys  0m0.000s
> 
>  $ dict -h localhost -I
>   dictd 1.10.11/rf on Linux 2.6.26-2-amd64
>   On nuix.uab.es: up 21+03:06:13, 813 forks (1.6/hour)
>   
>   Database      Headwords         Index     Data  Uncompressed
>   md5sums      580225              23 MB    32 MB        165 MB
> 
> 
>  $ dict -h dict.org -I
>   dictd 1.9.15/rf on Linux 2.6.30-bpo.1-686
>   On miranda.org: up 51+17:54:33, 16914217 forks (13619.5/hour)
>   
>   Database      Headwords         Index          Data  Uncompressed
>   gcide          203645       3859 kB         12 MB         38 MB
>   wn             154563       3089 kB       8744 kB         26 MB
>   moby-thes       30263        528 kB         10 MB         28 MB
>   elements          130          2 kB         14 kB         45 kB
>   vera             9203        103 kB        160 kB        558 kB
>   jargon           2374         42 kB        621 kB       1430 kB
>   [...]
> 
> 
> Part of this fast speed is that the input file for creating the
> dictionary is sorted, and then it does binary searches on a mmapped
> file.
> 
> As the protocol is inherently client-server, the same ontology
> (dictionary) can be (re-)used among different Invenio instances.  It is
> not a toy.  I haven't been able to make any noticeable use in my
> instance even massively querying it.  You can follow part of my
> experiments here:
> http://news.gmane.org/gmane.network.protocols.dict.user
> 
> Sorry, I had to say it,

actually this is a very cool hack :-) Indeed it might be very useful for
the use case of bibclassify, where the goal is (AFAIK) to find the most
represented keywords in a text. Indeed one might think to take the terms
of these ontologies and create dictionaries for dict... I don't know
though if this is can be used in semen (Roman?) but that is another
topic :-)

Cheers!
Sam

-- 
Samuele Kaplun
Invenio Developer ** <http://invenio-software.org/>

Reply via email to