[
https://issues.apache.org/jira/browse/STANBOL-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018817#comment-14018817
]
A. Soroka commented on STANBOL-1125:
------------------------------------
I'm not sure that the constraint of sorted-by-subject is too much to ask,
especially because with a format like N-Triples, it can be accomplished with
simple tools like POSIX "sort", but maybe that's just my bias. I do like to use
simple tools early in a processing chain. In any event, while a Solr-specific
indexer would doubtless be very useful (and I would gladly use it!) I would
ideally like to be able to use a Clerezza Yard as well. Perhaps different
strategies are appropriate for streaming into different indexing destinations…
Is there any policy on the indexing tool for this question? In other words,
does Stanbol expect to support all Yard implementations as indexing
destinations for all indexing tools, or just for the basic tool, with
"special-purpose" tools supporting various Yard impls as feasible?
> Create a lightweight EntityHub Indexing Tool for Freebase
> ---------------------------------------------------------
>
> Key: STANBOL-1125
> URL: https://issues.apache.org/jira/browse/STANBOL-1125
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Reporter: Rafa Haro
>
> Due to the enormous size of the dumps, current Freebase indexing tool in
> Stanbol can't barely work in machines without several gigas of RAM and/or SSD
> disks. JenaTDB importer has been identified as the bootle neck of the
> indexing process. To use an RDF database is mandatory in order to, for
> instance, use LDPath programs at indexing time.
> The idea is to develop a lightweight indexing tool that stream data from the
> dumps and push it directly to Solr. Despite losing some functionality, it is
> possible for any user to generate Freebase EntityHub indexes from any dump.
--
This message was sent by Atlassian JIRA
(v6.2#6252)