Hi Sam,

Even if you manage to store/load it into bibclassify, you will
probably wait forever -- my recent tests of seman vs bibclassify on a
corpus of 1830 docs shows that bibclassify professes them in 186 mins
(~6s per doc) and seman in 15 min. And this is HEP taxonomy, with only
40K entries! How many patterns are there inside the Eurovoc?

Seman is prepared to handle multi-language dictionaries, and use only
the part of the taxonomy, and it can handle big taxonomies (how big i
don't know, I had no speed problems running it with 200K dictionary
from the Wordnet)  --- but I don't have yet real numbers comparing
accuracy of the extraction mechanism - but we are working on it with
Juan. If these numbers are good (ie. if seman can extract good portion
of what bibclassify does) then you will be better off with seman than
bibclassify. But that one would need first a conversion of SKOS into
the seman format.

And about having the cache in a database: I have done it with seman
before and the performance was terrible, the I/O is really limiting
you, it was in several orders of magnitude -- like waiting 25 min
against 5s, really too bad (but I was using sqlite, so maybe it will
be fine with mysql: for seman that would be easy to test, it is using
sqlalchemy so it can connect to any major database -- but we would
have to develop a propriet) -- bibclassify would be hit less in this
aspect, because he is not accessing cache so often, but on the other
hand, it is accessing ALL objects in the cache for every document --
so i cannot say if it would worse of better, but very likely it will
be very slow anyway.

Cheers,

  roman




On Fri, Dec 17, 2010 at 11:48 AM, Samuele Kaplun <[email protected]> wrote:
> Hi,
>
> I am starting to play a bit with the EuroVoc
>
> <http://eurovoc.europa.eu/>
>
> ontology in order to integrate it into OpenAIRE Orphan Record
> Repository, for automatic keyword extraction for EU documents.
>
> This ontology is *big*! and multilingual. I can't even load it with
> RDFLIB on my laptop (4GB of RAM).
>
> I am currently trying to open it on a 24GB machine: it has already
> filled up 8GB and still loading!
>
> I was wondering if it makes sense at all to try to store a huge RDF/SKOS
> into a database table (see:
>
> <http://code.google.com/p/rdflib/wiki/SQL_Backend>
>
> ) to improve performances. Would this be useless WRT the cache that
> BibClassify is building? Maybe it would help before the cache has been
> created?
>
> Cheers,
> Sam
>
> P.s. I noticed in BibClassify code that the cache is created by using
> cPickle with protocol version 1. Just for curiosity, why hasn't be used
> protocol version 2 (or -1)? Have you experienced some degradation of the
> performance with higher protocol versions?
> --
> Samuele Kaplun
> Invenio Developer ** <http://invenio-software.org/>
>
>

Reply via email to