Hi Viktor and welcome to the Apache Stanbol community
El 20/11/13 18:02, Viktor Gal escribió:
Hi,
i've just started to use stanbol about a week ago and i must say it's a great
tool! kudos to all the developers!
i'm now trying to import and index the latest freebase data set and one thing
came into my mind that maybe it would be great to add other indexer engine
interfaces to stanbol, that can handle large corpora like http://terrier.org/
With the current indexer, you are going to need a highly equipped
machine (preferably with SSD disks and/or several GBs of RAM) for
building the site. Rupert can give you more details but, AFAIK, first of
all, you would need a lot of RAM for the entity scoring step. After
that, all the triples are first stored in a JenaTDB based triple store
(which implies a huge load of I/O disk operations) in order to allow
some pre-processing (like LDPath based entity filtering) before finally
indexing the entities in a Yard. So, the computation problem occurs
while storing the triples in JenaTDB and not while indexing the entities
in a Yard (at least with a SolrYard).
Initially, site building (indexing) is not a task that you usually need
to do very often, therefore, in my honest opinion, I don't know if it
worth to have a distributed process for it and after indexing, current
yards seems to be performing very well for searching. Also with last
versions of Solr or SolrCloud, it is possible to distribute the index.
as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred
based RDF storage and this way we could easily calculate for example real
PageRank values on the freebase data set by using mahout's pagerank
implementation.
anybody maybe knows a good mapred based RDF storage? i've seen some people
talking about HBase...
That would be very nice in my opinion, although I'm still not sure about
two things: how would a distributed triple store work and if that will
really solve the storing problem. So far, we have experimented with
graph databases like Neo4J providing RDF-store capabilities through
Blueprints Sail Implementation [1]. TitanDB and OrientDB are examples of
distributed graph databases also with Blueprints implementations, but we
haven't tried them yet.
Regarding the JenaTDB bottleneck problem, I have been working on a
workaround for indexing the entities in a Yard without passing through
the triple store, something like Streaming indexing: from the dump
directly to the Yard. It implies that you are not going to be able to do
some kind of pre-processing like LDPath filtering or transformations,
but if you don't need it, the indexing time is significantly reduced. I
should have committed it today but currently I'm having issues with my
Maven version for building Stanbol so, as soon as I solve them I will do
it. It would be nice if someone else can test it.
Regards,
Rafa
[1] - https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation
of course this would require some work both in terrier and mahout, but then
again for data sets like freebase this would make a lot of things faster/easier
(if one has the cluster for it).
happy to see comments on this!
cheers,
viktor