Hi Sam, Even if you manage to store/load it into bibclassify, you will probably wait forever -- my recent tests of seman vs bibclassify on a corpus of 1830 docs shows that bibclassify professes them in 186 mins (~6s per doc) and seman in 15 min. And this is HEP taxonomy, with only 40K entries! How many patterns are there inside the Eurovoc?
Seman is prepared to handle multi-language dictionaries, and use only the part of the taxonomy, and it can handle big taxonomies (how big i don't know, I had no speed problems running it with 200K dictionary from the Wordnet) --- but I don't have yet real numbers comparing accuracy of the extraction mechanism - but we are working on it with Juan. If these numbers are good (ie. if seman can extract good portion of what bibclassify does) then you will be better off with seman than bibclassify. But that one would need first a conversion of SKOS into the seman format. And about having the cache in a database: I have done it with seman before and the performance was terrible, the I/O is really limiting you, it was in several orders of magnitude -- like waiting 25 min against 5s, really too bad (but I was using sqlite, so maybe it will be fine with mysql: for seman that would be easy to test, it is using sqlalchemy so it can connect to any major database -- but we would have to develop a propriet) -- bibclassify would be hit less in this aspect, because he is not accessing cache so often, but on the other hand, it is accessing ALL objects in the cache for every document -- so i cannot say if it would worse of better, but very likely it will be very slow anyway. Cheers, roman On Fri, Dec 17, 2010 at 11:48 AM, Samuele Kaplun <[email protected]> wrote: > Hi, > > I am starting to play a bit with the EuroVoc > > <http://eurovoc.europa.eu/> > > ontology in order to integrate it into OpenAIRE Orphan Record > Repository, for automatic keyword extraction for EU documents. > > This ontology is *big*! and multilingual. I can't even load it with > RDFLIB on my laptop (4GB of RAM). > > I am currently trying to open it on a 24GB machine: it has already > filled up 8GB and still loading! > > I was wondering if it makes sense at all to try to store a huge RDF/SKOS > into a database table (see: > > <http://code.google.com/p/rdflib/wiki/SQL_Backend> > > ) to improve performances. Would this be useless WRT the cache that > BibClassify is building? Maybe it would help before the cache has been > created? > > Cheers, > Sam > > P.s. I noticed in BibClassify code that the cache is created by using > cPickle with protocol version 1. Just for curiosity, why hasn't be used > protocol version 2 (or -1)? Have you experienced some degradation of the > performance with higher protocol versions? > -- > Samuele Kaplun > Invenio Developer ** <http://invenio-software.org/> > >
