[
https://issues.apache.org/jira/browse/STANBOL-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018791#comment-14018791
]
Rafa Haro commented on STANBOL-1125:
------------------------------------
Hi Soroka,
I worked on this some months ago. In fact, I coded most of the necessary stuff
as an extension of the current indexing tool. As you have pointed out, I
assumed that the dump was ordered by subject, so I was temporally storing the
entities in memory until the subject change, that was when I considered the
entity had been completely crawled. I never committed it because the sorted by
subject constraint sounded like a very tough one.
Now actually I'm thinking in another possible approach to prevent that
constraint to be taken into account, that is to use Solr Atomic Updates
https://wiki.apache.org/solr/Atomic_Updates, at least with the Solr Yard. I
would need to take a look to how the schema is managed for the SolrYard because
the problem with the Atomic Updates is that all the fields must be stored for
preventing losing information
> Create a lightweight EntityHub Indexing Tool for Freebase
> ---------------------------------------------------------
>
> Key: STANBOL-1125
> URL: https://issues.apache.org/jira/browse/STANBOL-1125
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Reporter: Rafa Haro
>
> Due to the enormous size of the dumps, current Freebase indexing tool in
> Stanbol can't barely work in machines without several gigas of RAM and/or SSD
> disks. JenaTDB importer has been identified as the bootle neck of the
> indexing process. To use an RDF database is mandatory in order to, for
> instance, use LDPath programs at indexing time.
> The idea is to develop a lightweight indexing tool that stream data from the
> dumps and push it directly to Solr. Despite losing some functionality, it is
> possible for any user to generate Freebase EntityHub indexes from any dump.
--
This message was sent by Atlassian JIRA
(v6.2#6252)