Hi Viktor and welcome to the Apache Stanbol community

El 20/11/13 18:02, Viktor Gal escribió:
Hi,

i've just started to use stanbol about a week ago and i must say it's a great 
tool! kudos to all the developers!

i'm now trying to import and index the latest freebase data set and one thing 
came into my mind that maybe it would be great to add other indexer engine 
interfaces to stanbol, that can handle large corpora like http://terrier.org/
With the current indexer, you are going to need a highly equipped machine (preferably with SSD disks and/or several GBs of RAM) for building the site. Rupert can give you more details but, AFAIK, first of all, you would need a lot of RAM for the entity scoring step. After that, all the triples are first stored in a JenaTDB based triple store (which implies a huge load of I/O disk operations) in order to allow some pre-processing (like LDPath based entity filtering) before finally indexing the entities in a Yard. So, the computation problem occurs while storing the triples in JenaTDB and not while indexing the entities in a Yard (at least with a SolrYard).

Initially, site building (indexing) is not a task that you usually need to do very often, therefore, in my honest opinion, I don't know if it worth to have a distributed process for it and after indexing, current yards seems to be performing very well for searching. Also with last versions of Solr or SolrCloud, it is possible to distribute the index.

as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred 
based RDF storage and this way we could easily calculate for example real 
PageRank values on the freebase data set by using mahout's pagerank 
implementation.

anybody maybe knows a good mapred based RDF storage? i've seen some people 
talking about HBase...
That would be very nice in my opinion, although I'm still not sure about two things: how would a distributed triple store work and if that will really solve the storing problem. So far, we have experimented with graph databases like Neo4J providing RDF-store capabilities through Blueprints Sail Implementation [1]. TitanDB and OrientDB are examples of distributed graph databases also with Blueprints implementations, but we haven't tried them yet.

Regarding the JenaTDB bottleneck problem, I have been working on a workaround for indexing the entities in a Yard without passing through the triple store, something like Streaming indexing: from the dump directly to the Yard. It implies that you are not going to be able to do some kind of pre-processing like LDPath filtering or transformations, but if you don't need it, the indexing time is significantly reduced. I should have committed it today but currently I'm having issues with my Maven version for building Stanbol so, as soon as I solve them I will do it. It would be nice if someone else can test it.

Regards,
Rafa

[1] - https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation

of course this would require some work both in terrier and mahout, but then 
again for data sets like freebase this would make a lot of things faster/easier 
(if one has the cluster for it).

happy to see comments on this!

cheers,
viktor


Reply via email to