Re: entity linking vs classification

Maatari Daniel Okouya Thu, 25 Sep 2014 05:08:53 -0700

I’m ok with everything you said and appreciate you taking the time. I think you 
also properly capture my concern.

I think at this point, good examples of the different usage scenario would be 
very useful. 

I think so far there is 2 of them here: 

1.a- Using the result of the enhancer for semantic SEO. 

1.b - Using the result of the enhancer for semantic search from within your own 
custom semantic search solution i.e. your own triple store and the website that 
goes with it.

2 - Using the result of the enhancer for semantic search from within the 
Stanbol framework. 

I think if you could just explain how to use the result of the enhancer in 
these several case, it would make thins clear. 

What i guess so far is that in : 

1.a - You will need to construct your own triple out of the triple returned by 
the enhancer. In other words, building your own schema.org based description of 
the resource, by going trough all the triple return by the enhancer, carefully 
selecting those you think are relevant for your description.

1.b - This is where it is tricky. From what i understood in what you said, 
using let say Schema.org + FOAF + maybe your own domain ontology** (i.e. 
building the description with those as in 1.a) to then use it for semantic 
search might not be the best choice….   This leads and connect to my last point

2 - The question here is: Would you say that it is better and enough, to use 
the descriptions return by the enhancer as is, without transformation, to 
actually to perform semantic search on ? if no what information would you keep 
and in which form (which triple with which vocabulary(ies)) ?

That is pretty much the point i wanted to clarify. Brief concrete example of 
these scenario would really help. 

Actually this are the two use case that i have to deal with in my organization 
(non-profit). 

(i) On one side we want to increase our ability to disseminate publication 
(work material) to the outside world (Semantic SEO). We need on one side, some 
assistance in describing our content resources. This is where Stanbol would 
shine. 

(i)(1) Suggesting, topics to tag the content. Obviously this topic entities 
must either be directly entities taken from authority website such as DbPedia, 
NYTimes and etc… or or own, in the later case we will have to link them with 
owl:sameAS to external entities of our choices. Here i’m a bit uncertain as to 
when to use SKOS concept or real entities ? meaning knowledge graph vs 
taxonomy. We could have an internal knowledge graph that we link or the same 
with a taxonomy. A combination could be used as well, by separating the 
vocabulary and/properties that we would use for SKOS concept and real entities, 
and using foaf:focus to relate the entity and the concept.

(i)(2)  I’m not sure of the second case anymore, which is not topic assignment 
but simply entity linking based on their presence in the content. I’m not sure 
if this would be relevant anymore, based on what you guys have said so far 
(depend on the answer to my second point above). Sound like entity linking, is 
strongly related to the way the stanbol eco system envision the semantic 
search. 
Maybe it could serve the purpose of tagging the text of your content in an HTML 
file (again some triple transformation for the right description would be 
necessary IMHO). However i doubt that it would be useful in the context of a 
PDF for instance. We want to describe the resource as a whole here. It is 
publication material that we expose. “"The only way i see it useful for a PDF 
is in the context of semantic search””” which lead me to my last point.

(ii)(1) On the other side we want to provide people with the ability to find 
information within our website such that the search would be more tailored, 
domain specific, etc… than using google. This is where semantic search comes 
into play. Now with what i understood from your reading, i’m not sure exactly 
how to use stanbol result. I’m sure the topic can strongly help. However how 
about entity linking? If i understood well no transformation to the description 
is needed. You keep the information as is and you build your search on it. 
Because you have a semantic description of what is inside the content and where 
inside of it?

I hope this is not too much information. However i figure out that, taking the 
time, to give the background the intent and my specific questions, could in 
better understanding my actual questions. 

Many thanks,

Maat

-- 
Maatari Daniel Okouya
Sent with Airmail

On 25 Sep 2014 at 00:27:43, Bhoomin Pandya ([email protected]) wrote:

Hi Mataari;:  

I am trying to answer from what I gather from your question and  
discussion with Rupert. I agree completely with Rupert. Please,  
correct me if you feel I am wrong in any ways.  

Stanbol has many amazing features and offers end to end enhancement  
making it complete enhancement engine with RESTful API.  

Enhancer generated entities could be used in various ways. These  
entities also would be very useful in Keywords and SEO level  
improvements. Enhancer/Contenthub gives you an output RDF/XML &  
N-Triples with enhancements for text data directly which is very  
unique feature. N-Triples help you construct the data even if subject,  
predicate and object are stored on different servers. This is a great  
feature for linking besides SPARQL endpoints.  

On Stanbol OWL files are manged by using Ontonet for later being  
consumed by reasoning services, refactorers, rule engines. OWL is  
knowledge representation and uses Taxonomy and Hierarchy etc for  
classification/categorization (as Rupert mentioned). eg. Site Map  
could be constructed using Ontologies.  

SKOS is knowledge organization (front end, middle ware and back-end  
level) at concept level which includes collection of concepts,  
relationship between them, and it covers all controlled vocabularies.  
It is in fact Semantic Intergration.  

Your problems seems to be linked to some deployment process either a  
site or for site SEO. To me the basics used for content enhancement is  
also useful for SEO, too. You can use the schema.org vocabulary along  
with the microdata, RDFa, or JSON-LD formats to add information to  
your HTML content ( as it is mentioned on their site). I feel the  
schema.org is more for the html content rather than producing triples  
from the content which please check.  

I hope this solves your query. Please, let me know your views.  

Many Thanks  
Bhoomin Pandya  

On Thu, Sep 25, 2014 at 4:37 AM, Maatari Daniel Okouya  
<[email protected]> wrote:  
> It all make sense now thanks.  
>  
> One thing that I do not really comprehend, although i may have an idea of how 
> to hack it, is how do you go from the standard description based on the 
> Standbol vocabulary, to producing let say a triple with schemer.org 
> vocabulary: ex:resource schema:about dbPedia:Bob_Marley.  
>  
> I would just appreciate to understand the vision behind it. That is, how did 
> the Stanbol team envision the best practice to produce that.  
>  
> Is it something that the application that use Stanbol should do. Upon 
> receiving some suggestion and being ok with the Tag, the resource should be 
> marked as such. But should the new triples go back in StanBol, or another 
> store, i’m not sure.  
>  
> I’m not sure to properly understand how the tagging of the enhancer is used. 
> is it dedicated at developer to next build, an appropriate description based 
> on it? Because if one want to optimise his description for search engine 
> optimisation, then the Stanbol descriptions are invisible to google for 
> instance.  
>  
>  
> Could anyone help to clarify the main idea here.  
>  
> Many thanks,  
>  
> M  
>  
>  
> --  
> Maatari Daniel Okouya  
> Sent with Airmail  
>  
> On 23 Sep 2014 at 08:09:27, Rupert Westenthaler 
> ([email protected]) wrote:  
>  
> Hi Maatari  
>  
> Not sure if I fully understand your questions ...  
>  
> ad (1), (2):  
>  
> * Entity Linking does use "surface forms" to detect mentions of those.  
> "Surface Forms" are the strings used to refer to an Entity within a  
> text. So typically the labels of an Entity are used as "Surface Forms"  
> for linking.  
> * Named Entity Recognition is most often done with Machine Learning.  
> However also some rule based systems are in use. In case of Machine  
> Learning you need to provide a training set. So if you want to detect  
> Entities of a specific type you will need to provide a training set.  
> Annotating ~1000 occurrences and you will start to get a useable  
> model.  
> * For Categorization it is the same. The classifiers used for this  
> task also require training data. You will need to manually classify  
> documents for your categories. In this case think about ~40 documents  
> per category.  
>  
> ad (3): Sorry I do not understand your question. Just let me answer to  
>  
>> Can the enhancement indeed, categorise according to non-skos instance, that 
>> are in an external dataset?  
>  
> The TopicAnnotationEngine [1] in Stanbol does not require SKOS. You  
> can also define concepts by names (see page 7). SKOS is supported (see  
> page 8) but not required. The critical thing is not to define the  
> concepts but to provide the training data ^^  
>  
> best  
> Rupert  
>  
> [1] http://stanbol.apache.org/presentations/Topic-Classification.pdf  
>  
>  
> On Mon, Sep 22, 2014 at 4:37 PM, Maatari Daniel Okouya  
> <[email protected]> wrote:  
>> I understand better.  
>>  
>> I think the key sentence here was: “Important is that Entity Linking 
>> requires an actual mention of the  
>> Entity in the text while categories do not depend on such mentions. "  
>>  
>>  
>> -So basically wether the category is based on a SKOS DataSet or Not, this 
>> does not matter at all !!!  
>>  
>> -In both case they link to a dataset, it does not matter if it is SKOS based 
>> or not. The difference is how the entity to which we link comes up.  
>>  
>>  
>>  
>> Few questions here if you don’t mind. I’m not trying to reemployment things 
>> here, but simply to better understand things so i can use the tool properly. 
>>  
>>  
>>  
>> 1) How would the information of a specific category set be fetch ? The 
>> process of linking in categorisation must be different, in that you do not 
>> have the type to guide you. You may well end up with synonyms, without the 
>> type erros would occurs. I can see why using a controlled vocabulary would 
>> be more easy. There, the disambiguation is within the label directly.  
>> Would you confirm my assumption here ? That categorisation with a Skos based 
>> dataset (thesarus) is more easy ?  
>>  
>> 2) Is the reason for the Named Entity Recognition to limit itself to these 
>> three specific Type “Pertinence” ? Also would this type be customisable, 
>> meaning could you have a bit more types ?  
>>  
>>  
>>  
>> 3) What i want to achieve is describing some content resource according to 
>> schema.org. For creativeWork, it has the property “schema:about” which must 
>> point to a “schema:Thing”. I presume by that, google is expecting here, 
>> something else than a controlled Concept. I’m not saying that it is not 
>> possible. In the sameWay, with FOAF:Topic that i would also use, I want to 
>> point to the real thing rather than a control vocabulary Concept. I would 
>> rather use, dc:subject for the SKOS:Concept. Does it make sense? Can the 
>> enhancement indeed, categorise according to non-skos instance, that are in 
>> an external dataset?  
>>  
>>  
>> Many thanks,  
>>  
>> Maatari  
>>  
>>  
>>  
>> --  
>> Maatari Daniel Okouya  
>> Sent with Airmail  
>>  
>> On 22 Sep 2014 at 06:49:14, Rupert Westenthaler 
>> ([email protected]) wrote:  
>>  
>> Hi Maatari,  
>>  
>> On Mon, Sep 22, 2014 at 8:22 AM, Maatari Daniel Okouya  
>> <[email protected]> wrote:  
>>> I’m a bit confused about few concept. Could someone clarify them a bit.  
>>>  
>>>  
>>> When it comes to assigning some topics to a content resource, what would be 
>>> the difference between entity linking and categorization ?  
>>>  
>>  
>> First lets explain the terminology as used by Stanbol. For that I will  
>> use a todays headline:  
>>  
>> "Lewis Hamilton not thinking about title after winning Singapore GP"  
>>  
>> Named Entity Recognition: Detects mentions of Entity types within the  
>> text. Typically Persons, Organizations and Locations  
>> * Lewis Hamilton -> person  
>> * Singapore -> location  
>>  
>> Entity Linking: Detects mentions of known Entities within the processed Text 
>>  
>> * Lewis Hamilton -> http://en.wikipedia.org/wiki/Lewis_Hamilton  
>> * Singapore Grand Prix -> http://en.wikipedia.org/wiki/Singapore_Grand_Prix  
>>  
>> Categorization: Assigns the content to a fixed set of categories.  
>> Categories might be hierarchical. A typical example are the IPTC Media  
>> Topics [1] which I will use for this example.  
>> * sport -> http://cv.iptc.org/newscodes/mediatopic/15000000  
>> * Formula One -> http://cv.iptc.org/newscodes/mediatopic/20000994  
>>  
>> Important is that Entity Linking requires an actual mention of the  
>> Entity in the text while categories do not depend on such mentions.  
>>  
>>> What I see as of now, within some tools well established is the 
>>> classification part. Usually it makes use of a control vocabulary to 
>>> classify the content. Output = resource dc:Subject controledVocabularyTerm  
>>>  
>>> However, what i also see in the description of content resource online 
>>> within some authority website is to link the document to external non skos 
>>> resource via for instance the Foaf:Topic.  
>>>  
>>> In that second case, do we have both an entity linking and a classification 
>>> ? or is it that both are the same, it is just that the knowledge base 
>>> change, from external source to controlled vocabulary. Which would mean 
>>> that in the world of linked data, content classification / categorization 
>>> include entity linking? In that case i would say that, the same was 
>>> happening when linking to a controlled vocabulary term.  
>>>  
>>  
>> IMO the properties used to represent analysis results do not  
>> necessarily indicate if the results express linked entities or  
>> categorizations. Based on the definition both dc:subject and  
>> foaf:topic they should be both used for categories.  
>>  
>>>  
>>> I'm little confused here. If someone, could clarify these notion i would 
>>> appreciate.  
>>  
>> hope this helps  
>> best  
>> Rupert  
>>  
>> [1] http://cv.iptc.org/newscodes/mediatopic  
>>  
>> --  
>> | Rupert Westenthaler [email protected]  
>> | Bodenlehenstraße 11 ++43-699-11108907  
>> | A-5500 Bischofshofen  
>> | REDLINK.CO 
>> ..........................................................................  
>> | http://redlink.co/  
>  
>  
>  
> --  
> | Rupert Westenthaler [email protected]  
> | Bodenlehenstraße 11 ++43-699-11108907  
> | A-5500 Bischofshofen  
> | REDLINK.CO 
> ..........................................................................  
> | http://redlink.co/

Re: entity linking vs classification

Reply via email to