Re: [GSoC - Midterm] Freebase Disambiguation in Stanbol

Antonio Perez Mon, 29 Jul 2013 08:22:47 -0700

Hi all

Thanks Rupert.


About the algorithm used by [5] to create the relations, I'm going to
update the information and let you guys know.

Regards


On Mon, Jul 29, 2013 at 3:28 PM, Rupert Westenthaler <
rupert.westentha...@gmail.com> wrote:

> Hi Antonio, all
>
> As the mentor of Antonio I will try to give some additional
> information about the progress and results of this GSoC project.
>
> On Mon, Jul 29, 2013 at 10:15 AM, Antonio Perez <ape...@zaizi.com> wrote:
> > Hi all
> >
> > Since this week is the midterm evaluation of the GSoC projects, I want to
> > tell you the status of this project.
> >
> > I began my project trying to index Freebase data using the Freebase
> indexer
> > in Stanbol but this process was too expensive to be done in a normal
> > computer (with about 8 GB RAM and non SSD hard disk).
> >
> > I was able to create a Referenced Site with Freebase data using Rupert's
> > index (generated using an SSD hard disk)
> >
>
> The very same issue also affected dileepaj (the other GSoC project) a
> lot. The reason for that is that importing RDF Triples to an triple
> store is very expensive IO wise. So for every RDF triple the following
> steps need to be performed:
>
> * existence check for every imported RDF resource (subject, predicate,
> object)
> * existence check for the RDF triple already
> * creation of non-existing nodes and triple incl. update of the index.
>
> Jena TDB uses memory-mapped files for holding those information. That
> means it scales well as long as those lookup tables to fit into memory
> (the files can be mapped to RAM). If you exceed those limit a lot of
> disc IO is generated. In case of an SSD you can still expect about
> 1-5k Triples/sec to be imported, but in case of a normal disc import
> speeds are no longer within an expected range for importing large
> datasets.
>
> As Antonio had not access to a server with enough resources (even the
> biggest Server you can rent on Amazon is not sufficient) we decided to
> NOT continue this part within the GSoC project. However as this is a
> problem that also affects other users we decided that we still would
> like to have a workaround for this limitation.
>
> > Currently, Rafa Haro is working on the Jena TDB part of the indexer in
> > order to speed up the process of index Freebase data.
>
> Because of that Rafa volunteered to create an IndexingSource for the
> Entityhub Indexing Tool that directly "streams" triples without adding
> them beforehand to a triple store. This indexing source will have some
> limitations:
>
> * Triples need to be sorted by SPO - but this is true for all Turtle
> and N3 formatted RDF dumps
> * This indexing source can not support LDPath what limits to some
> extend the processing that can be done to entities
>
> However this implementation will have very low RAM requirements. Disc
> IO will also be limited to those required by Solr to store the indexed
> entities.
>
> Excluding this Task from the GSoC project and assigning it to Rafa
> allowed us to continue without much delay. A big thx to Rafa
> (currently on vacation) for taking over this!
>
> >
> > The next task was to parse the Wikilinks extended dataset [1] and store
> it
> > in a Jena TDB database, in order to take advantage of the contained
> > information to be used in some tasks, like disambiguation.
> > Moreover a service has been created (along with the parser tool) in order
> > to query the data and retrieve information about Wikilink items. The code
> > and more information about this library can be found at [2]
> >
>
> We where also discussion to implement a more generic TrainingSet
> service based on this work. This service would allow to
>
> * publish manually annotated content
> * consume annotations for the training of NLP components or to build
> knowledge bases for disambiguation algorithms.
>
> This service would allow to manage existing annotation sets (such as
> Wikilinks) but also to publish user made annotations (e.g. by
> confirming/rejecting suggestions provided by Stanbol).
>
> This service would also replace the TrainingSet service currently part
> of the TopicEngine.
>
> The implementation of such a Service will not be part of this GSoC
> project as it is off-topic to the original goals of as stated by the
> proposal. Rafa and myself will work with some help of Antonio on an
> initial design when we are back from vacation (2nd half of August). As
> soon as the Jira Issue is created I will link it with this thread.
>
> > Ideally, when the new Freebase indexer is finished and tested, I would
> like
> > to integrate Freebase data and Wikilinks data in the same referenced
> site,
> > because the Wikilinks extended dataset contains references to Freebase
> > entities, so it's relatively easy to link both informations. But for now,
> > we can use the Wikilinks information to perform other tasks.
>
> Here the Idea is to build a Dataset for Freebase that can be used with
> the existing Solr MLT based enhancement engine (the result of last
> years GSoC project). However for me this does not has a high priority
> as I would like Antonio keep a focus on the Graph based
> Disambiguation.
>
> >
> > In order to finish the work for the midterm, I have develop a tool to
> > import Freebase data using the BaseKBLime data dump [3] into a graph
> > database (Neo4j right now using the Tinkerpop Blueprints interfaces [4]).
> > Moreover, a simple algortihm to "weight" the graph is done during the
> > import process.
> > The code and more information about this tool can be obtained in [5].
> >
> > With this information, I have got a Knowledge Base which can be used to
> > develop new graph-based disambiguation algorithms.
> >
>
> FYI: BaseKBLime is a cleaned up version of the Freebase RDF dump.
> Tinkerpop Blueprints do define a generic Java API for Property Graphs.
> Property Graphs do allow key/value properties for both Nodes and Edges
> of an Graph. In this case we do use those properties to store
> information about the origin and the weights of relations between
> concepts.
>
> Note that while Blueprints is ASL compatible (BSD), the currently used
> database - Neo4j is not. However there are a lot of other data base
> options that are compatible with the Apache License.
>
> @Antonio: can you please extend the documentation of [5] with an real
> example for the "Algorithm used to create the relations". Especially
> for "mediated relation"s. The current examples with aaa.bbb.ccc are
> really hard to follow. A real example of a Freebase topic would make
> the idea easier to understand.
>
> > So far it is the work done for the midterm
> >
> > The expected work for the second part is to develop a disambiguation
> > algorithm using the generated graph. To do this, I am taking a look two
> > papers ([6] and [7]) to take some ideas to develop a new algorithm.
> >
>
> As mentioned the current plan for the 2nd part of the GSoC project is
> to focus on graph algorithm based disambiguation(s) and implementation
> of a such based on the Tinkerpop API.
>
> >
> > This is all folks, so please feel free to comment. Comments are more than
> > welcome.
>
> Thx Antonio for all the work you have put into your GSoC project. Your
> topic is very hard and you had to overcome or workaround a lot of
> problems until now. I am sure that this work can be a major
> contribution to the Stanbol Community. To ensure this to happen we
> need to inform the community more frequently. Especially about topics
> like
>
> * Streaming RDF indexing Source
> * Link to the Wikilinks RDF file as created by [2]
> * TrainingSet service
> * The Graph as imported by [5] (similar as the Entityhub Indexing Tool
> creates an Index that can be loaded to Stanbol)
> * the disambiguation engine(s) to be implemented based on the graph
> generated by [5]
>
> Looking forward to a exciting 2nd part of GSoC 2013
> best
> Rupert
>
> >
> > Best regards
> >
> > --
> >
> > ------------------------------
> > This message should be regarded as confidential. If you have received
> this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> copy
> > by an authorised signatory.
> >
> > Zaizi Ltd is registered in England and Wales with the registration number
> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > London W6 7AN.
>
>
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN.

Re: [GSoC - Midterm] Freebase Disambiguation in Stanbol

Reply via email to