[ 
https://issues.apache.org/jira/browse/STANBOL-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018093#comment-14018093
 ] 

A. Soroka commented on STANBOL-1125:
------------------------------------

Freebase is not the only source of data that would benefit from this kind of 
tool. I've been indexing dumps of the U.S. Library of Congress' Linked Data 
([#note1]) and found the resource requirements onerous, and I'm going to have 
to set up some kind of special workflow to index the Virtual International 
Authority File. ([#note2]) Many of the data sources in which I am interested 
require little or no processing of the kind we would write in LDPath programs-- 
they are to be indexed virtually "plain". For those data sources that do 
require some processing of simple kinds (e.g. translating predicates), it might 
be possible to allow that as a step on a single Representation before storing 
that Representation into a Yard, instead of as a transduction over the whole 
store of RDF.

I would be interested in working on this problem. It seems to me at a first 
glance that with suitable restrictions on the inputs (perhaps N-Triples files 
sorted by subject URI?) a very performant streaming solution could be developed 
with a minimal cost in space for computation.

({anchor:note1}1) Controlled vocabularies for subject headings and name 
authorities widely used in library metadata, available at:

http://id.loc.gov/download/

({anchor:note2}2) Also a name authority system, but federated from the linked 
data of several different national libraries, available at:

http://viaf.org/viaf/data/

> Create a lightweight EntityHub Indexing Tool for Freebase
> ---------------------------------------------------------
>
>                 Key: STANBOL-1125
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1125
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>            Reporter: Rafa Haro
>
> Due to the enormous size of the dumps, current Freebase indexing tool in 
> Stanbol can't barely work in machines without several gigas of RAM and/or SSD 
> disks. JenaTDB importer has been identified as the bootle neck of the 
> indexing process. To use an RDF database is mandatory in order to, for 
> instance, use LDPath programs at indexing time.
> The idea is to develop a lightweight indexing tool that stream data from the 
> dumps and push it directly to Solr. Despite losing some functionality, it is 
> possible for any user to generate Freebase EntityHub indexes from any dump.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to