Hello Reto and Rupert, I was looking at the same components and some more things that should be put in a clearer way are the sites (Referenced/Managed) and the Yards (Solr/Clerezza) that can be used for linking and interlinking. I write here my understanding about the components currently provided by Stanbol for the linking and interlinking tasks to be sure it is correct (or not) and also some questions.
In Working with Custom Vocabularies<http://stanbol.apache.org/docs/trunk/customvocabulary.html>it is said that a Referenced or a Managed site can be used for linking. Both must be based on a Solr yard so that it will be possible to do keyword search. It should be clear but it must be underlined that they must be used with RDF datasets if one wants to look for entities using keywords. While it is possible to add (and get indexed) new RDF triples to a Managed site the same cannot be done with a Referenced site that once has been built with a proper tool, as explained in the same page cited above, cannot be updated in the same way. In order to use these sites for linking and interlinking an enhancement engine (EntiyhubLinkingEngine or NamedEntityTaggingEngine) must be configured in the Felix console providing the identifier of the site to use to search entities (URI) to link to. In the configuration panels only Referenced Sites are mentioned but also managed Sites based on Solr Yard should work (?). The first type of engines compare tokens in the text that arrived to the engine, eventually through a chain which has a tokenizer before the linking engine, with the value of the rdfs:label property of the target RDF data indexed within the site to look for entities (subject URIs of the rdfs:label property). The result of the comparison is ranked and added to the contentitem metadata and finally sent to the client. The second type of linking engine (NamedEntityTagging) uses the result of a NER process. This means that it can be used only in a chain where a NER engine is provided before it. Currently can be configured only to work with person, organizations and places because only models with these types of entities are available in Stanbol. The NER engines look for entities of those type within text and is configured to use some well known URI for the types mentioned, for example http://dbpedia.org/ontology/Person for person. The result of the NER process is put in the contentitem's metadata and used by the next engine for interlinking that will use only the rdfs:label property attached to entities of those types (e.g. http://dbpedia.org/ontology/Person) for comparison. One second issue on this architecture, after the one about the use of Managed sites in the linking engines configuration panels instead of Referenced ones, is about doing the interlinking with the RDF data extracted from documents and stored in the content graph or in other graphs based on the Clerezza Yard. The only way to use these graph seems to be making a copy of the graph and store the data in a Solr yard to be used in a Managed/Referenced site. As the documentation about Managed and Referenced sites is quite good even if it lacks some details the same cannot be said about the Entityhub. It is not very clear if it is just an interface of all the sites (managed and referenced) or there is something more. To sum up the main points are: 1) is it possible to use a managed site instead of a referenced one in the linking engine configuration panels (both types) ? 2) which is the best way to do interlinking with RDF data in a graph within Stanbol with the current components ? Only the one I mentioned or there are other options ? 3) can anyone provide some details about the entityhub (not managed or referenced sites) ? Best Regards Luigi 2013/5/20 Rupert Westenthaler <[email protected]> > On Mon, May 20, 2013 at 3:07 PM, Reto Bachmann-Gmür <[email protected]> > wrote: > > Thanks Rupert for these clarification. > > > > One thing that still isn't clear. You say that the EntityLinking engines > > operate on a single toke, while named entity tagging works on pharses. > What > > does this mean, I see that EntityLinking detects multiple word entities. > > What are the cases EntityLinking cannot handle? > > Yes EntityLinking tries to match several tokens with labels of > entities within the controlled vocabulary, but it still considers > single tokens as a potential "match". > > In contrast NamedEntityLinking would not allow a link for "Peter" if > "Peter Mustermann" was recognized as named Entity. Also the "Peter > Mustermann jun." would only be suggested for "Peter Mustermann" in > that case. Even if the text would actually mention "Peter Mustermann > jun." > > best > Rupert > > > > > Cheers, > > Reto > > > > > > On Mon, May 20, 2013 at 2:05 PM, Rupert Westenthaler < > > [email protected]> wrote: > > > >> On Mon, May 20, 2013 at 12:34 PM, Reto Bachmann-Gmür <[email protected]> > >> wrote: > >> > Named Entity Tagging Engine: This creates entity references > exclusively > >> for > >> > substrings identied to denote a person, people or place by the named > >> entity > >> > recognizer. > >> > >> Correct. This Engine can use type restrictions based on the types > >> detected by NER when linking against the Vocabularies. In addition it > >> also searches for Entities matching the "phrase" detected as Named > >> Entities. The EntityLinking engine operates on single Tokens. > >> > >> > > >> > Entityhub Linking Engine: This creates the entity references using the > >> > results of NLP processing. Only some lexical categories are processed, > >> > these are determined by the parameter in "Processed Languages" as > well as > >> > with the "Link ProperNouns only". > >> > > >> > >> The Entityhub Linking Engine is a configuration of the > >> EntityLinkingEngine that uses the Entityhub to search for Entities in > >> the controlled vocabulary. It does not implement any linking > >> functionality itself. > >> > >> > >> > Keyword Linking Engine: "An engine that extracts keywords present > within > >> a > >> > Controlled Vocabulary mentioned within parsed ContentItem". I assumed > >> this > >> > would just link any matching word sequences without requiring any NLP > >> > (except word tokenization). However the config pane say that the > >> parameter > >> > "Min Token length" is ignored in case a POS (Part of Speech) tagger is > >> > available for the language of the parsed content. So is this using > NLP as > >> > well? > >> > > >> > >> This engine is deprecated. Its the predecessor of the Entity Linking > >> Engiine > >> > >> > >> > So this are the 3 Engines I find in the configuration. Then there's > also > >> > the EntityLinkingEngine according to > >> > > >> > https://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking > >> > > >> > >> This implements the Entity Linking process. To use it one needs to > >> provide implementations of the extension points (EntitySearcher and > >> LabelTokenizer). > >> > >> > Confusingly > https://stanbol.apache.org/docs/trunk/customvocabulary.html > >> > distinguishes > >> > between Named Entity Linking for which it refers to the Named Entity > >> > Tagging Engine and Keyword Linking for which it doesn't refer to the > >> > "Keyword Linking Engine" but to "Entityhub linking engine" (the > document > >> > has some issues: STANBOL-1075). > >> > >> "Keyword Linking" should no longer be used. "Named Entity Linking" and > >> "Entity Linking" are the preferred terms. > >> > >> You are right. The "Working with Custom Vocabularies" does have some > >> inconsistencies in the last part. "2. Keyword Linking" should be "2. > >> Entity Linking" and also the 2nd heading "Configuring Named Entity > >> Linking" should note "Configuring Entity Linking" instead. > >> > >> best > >> Rupert > >> > >> > >> -- > >> | Rupert Westenthaler [email protected] > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
