Re: distributed indexing

Rafa Haro Wed, 20 Nov 2013 10:23:28 -0800

Hi Viktor and welcome to the Apache Stanbol community


El 20/11/13 18:02, Viktor Gal escribió:

Hi,

i've just started to use stanbol about a week ago and i must say it's a great 
tool! kudos to all the developers!

i'm now trying to import and index the latest freebase data set and one thing 
came into my mind that maybe it would be great to add other indexer engine 
interfaces to stanbol, that can handle large corpora like http://terrier.org/

With the current indexer, you are going to need a highly equippedmachine (preferably with SSD disks and/or several GBs of RAM) forbuilding the site. Rupert can give you more details but, AFAIK, first ofall, you would need a lot of RAM for the entity scoring step. Afterthat, all the triples are first stored in a JenaTDB based triple store(which implies a huge load of I/O disk operations) in order to allowsome pre-processing (like LDPath based entity filtering) before finallyindexing the entities in a Yard. So, the computation problem occurswhile storing the triples in JenaTDB and not while indexing the entitiesin a Yard (at least with a SolrYard).

Initially, site building (indexing) is not a task that you usually needto do very often, therefore, in my honest opinion, I don't know if itworth to have a distributed process for it and after indexing, currentyards seems to be performing very well for searching. Also with lastversions of Solr or SolrCloud, it is possible to distribute the index.


as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred 
based RDF storage and this way we could easily calculate for example real 
PageRank values on the freebase data set by using mahout's pagerank 
implementation.

anybody maybe knows a good mapred based RDF storage? i've seen some people 
talking about HBase...

That would be very nice in my opinion, although I'm still not sure abouttwo things: how would a distributed triple store work and if that willreally solve the storing problem. So far, we have experimented withgraph databases like Neo4J providing RDF-store capabilities throughBlueprints Sail Implementation [1]. TitanDB and OrientDB are examples ofdistributed graph databases also with Blueprints implementations, but wehaven't tried them yet.

Regarding the JenaTDB bottleneck problem, I have been working on aworkaround for indexing the entities in a Yard without passing throughthe triple store, something like Streaming indexing: from the dumpdirectly to the Yard. It implies that you are not going to be able to dosome kind of pre-processing like LDPath filtering or transformations,but if you don't need it, the indexing time is significantly reduced. Ishould have committed it today but currently I'm having issues with myMaven version for building Stanbol so, as soon as I solve them I will doit. It would be nice if someone else can test it.


Regards,
Rafa

[1] - https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation


of course this would require some work both in terrier and mahout, but then 
again for data sets like freebase this would make a lot of things faster/easier 
(if one has the cluster for it).

happy to see comments on this!

cheers,
viktor

Re: distributed indexing

Reply via email to