Re: distributed indexing

Viktor Gal Wed, 20 Nov 2013 10:53:31 -0800

Hi Rafa

On 20/11/2013, at 7:23 pm, Rafa Haro <[email protected]> wrote:


> Hi Viktor and welcome to the Apache Stanbol community
> 
> El 20/11/13 18:02, Viktor Gal escribió:
>> Hi,
>> 
>> i've just started to use stanbol about a week ago and i must say it's a 
>> great tool! kudos to all the developers!
>> 
>> i'm now trying to import and index the latest freebase data set and one 
>> thing came into my mind that maybe it would be great to add other indexer 
>> engine interfaces to stanbol, that can handle large corpora like 
>> http://terrier.org/
> With the current indexer, you are going to need a highly equipped machine 
> (preferably with SSD disks and/or several GBs of RAM) for building the site. 
> Rupert can give you more details but, AFAIK, first of all, you would need a 
> lot of RAM for the entity scoring step. After that, all the triples are first 
> stored in a JenaTDB based triple store (which implies a huge load of I/O disk 
> operations) in order to allow some pre-processing (like LDPath based entity 
> filtering) before finally indexing the entities in a Yard. So, the 
> computation problem occurs while storing the triples in JenaTDB and not while 
> indexing the entities in a Yard (at least with a SolrYard).

heheh yeah i've been through this. lucky that i had SSD around me to do the 
task as initially i've started with a simple 7200RPM hdd, and it would have 
taken ages... this way it was only 1.5 day ;P

> Initially, site building (indexing) is not a task that you usually need to do 
> very often, therefore, in my honest opinion, I don't know if it worth to have 
> a distributed process for it and after indexing, current yards seems to be 
> performing very well for searching. Also with last versions of Solr or 
> SolrCloud, it is possible to distribute the index.

the idea actually came when i was talking with Rupert about generating the 
incoming links file for freebase. He told me that it would be much better if we 
could actually calculate PageRank for the pages instead of the current 
./fbranking.sh shell script.
That's when i thought about using mahout to do as it wouldn't be feasible with 
other libraries for such data set.
and since that would require anyhow storing the RDFs on HDFS, that's where the 
idea came across my mind that in that case we could actually use some sort of 
HDFS based storage (HBase?) to store the RDFs and then of course we could even 
use a hadoop based indexer, like terrier.

about the JenaTDB bottleneck: if the raw RDFs would reside on HDFS, and there 
would be a HDFS based triple store, then one could use map-reduce to load them 
in parallel. i.e. split up the data set among the hadoop nodes and let them 
load their part into the distributed tripe store.

cheers,
viktor

>> as terrier is mapreduce based (i.e. hadoop) it'd be great to have a mapred 
>> based RDF storage and this way we could easily calculate for example real 
>> PageRank values on the freebase data set by using mahout's pagerank 
>> implementation.
>> 
>> anybody maybe knows a good mapred based RDF storage? i've seen some people 
>> talking about HBase...
> That would be very nice in my opinion, although I'm still not sure about two 
> things: how would a distributed triple store work and if that will really 
> solve the storing problem. So far, we have experimented with graph databases 
> like Neo4J providing RDF-store capabilities through Blueprints Sail 
> Implementation [1]. TitanDB and OrientDB are examples of distributed graph 
> databases also with Blueprints implementations, but we haven't tried them yet.
> 
> Regarding the JenaTDB bottleneck problem, I have been working on a workaround 
> for indexing the entities in a Yard without passing through the triple store, 
> something like Streaming indexing: from the dump directly to the Yard. It 
> implies that you are not going to be able to do some kind of pre-processing 
> like LDPath filtering or transformations, but if you don't need it, the 
> indexing time is significantly reduced. I should have committed it today but 
> currently I'm having issues with my Maven version for building Stanbol so, as 
> soon as I solve them I will do it. It would be nice if someone else can test 
> it.
> 
> Regards,
> Rafa
> 
> [1] - https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation
>> 
>> of course this would require some work both in terrier and mahout, but then 
>> again for data sets like freebase this would make a lot of things 
>> faster/easier (if one has the cluster for it).
>> 
>> happy to see comments on this!
>> 
>> cheers,
>> viktor
>> 
>

Re: distributed indexing

Reply via email to