Re: entity linking vs classification

Rupert Westenthaler Fri, 26 Sep 2014 04:53:08 -0700

... continuing ...

On Thu, Sep 25, 2014 at 2:07 PM, Maatari Daniel Okouya
<okouy...@yahoo.fr> wrote:
> I’m ok with everything you said and appreciate you taking the time. I think 
> you also properly capture my concern.
>
> I think at this point, good examples of the different usage scenario would be 
> very useful.
>
> I think so far there is 2 of them here:
>
> 1.a- Using the result of the enhancer for semantic SEO.
>


 +1, this is a typical use case for Stanbol

> 1.b - Using the result of the enhancer for semantic search from within your 
> own custom semantic search solution i.e. your own triple store and the 
> website that goes with it.

+1. Also a typical use case. You send some text to Stanbol and use the
results to fill fields in your Solr Index

>
> 2 - Using the result of the enhancer for semantic search from within the 
> Stanbol framework.
>

Stanbol does not provide a solid Semantic Search Framework. The
Contenthub was intended to be such a component, but its development
stopped since some time ago. This is also the reason why it is no
longer included in the trunk.

However you can push triples received from Stanbol to any TripleStore
and perform SPARQL queries over those results.

>
> I think if you could just explain how to use the result of the enhancer in 
> these several case, it would make thins clear.
>
>
> What i guess so far is that in :
>
>
> 1.a - You will need to construct your own triple out of the triple returned 
> by the enhancer. In other words, building your own schema.org based 
> description of the resource, by going trough all the triple return by the 
> enhancer, carefully selecting those you think are relevant for your 
> description.

If the vocabulary already uses schema.org information dereferenced
information for linked entities will already use schema.org. But you
are right. You will need to convert the Stanbol Enhancements to a
simpler structure (e.g. ex:resource schema:about {linked-entity})

>
> 1.b - This is where it is tricky. From what i understood in what you said, 
> using let say Schema.org + FOAF + maybe your own domain ontology** (i.e. 
> building the description with those as in 1.a) to then use it for semantic 
> search might not be the best choice….   This leads and connect to my last 
> point
>

Ontology alignment is a tricky topic and outside the scope of Apache
Stanbol. However the Dereferenceing Engines do support some feature
that can be used for this purpose (see [1] for more information)


> 2 - The question here is: Would you say that it is better and enough, to use 
> the descriptions return by the enhancer as is, without transformation, to 
> actually to perform semantic search on ? if no what information would you 
> keep and in which form (which triple with which vocabulary(ies)) ?

Depends how you want to do semantic search. If you use SPARQL for
semantic search it would be possible. If you use Stanbol to build a
smarter Solr Index you will definitely need to transform the results.
But even with SPARQL a simplification of the structure optimized for
your problem domain will be preferable. (e.g. converting
"{document-uri} <-- fise:extracted-from -- {entity-annotation} --
fise:entity-reference --> {entity}" to "{document-uri} -- schema:about
--> {entity}".

>
> That is pretty much the point i wanted to clarify. Brief concrete example of 
> these scenario would really help.
>

Hope my comments are concrete enough.

>
>
> Actually this are the two use case that i have to deal with in my 
> organization (non-profit).
>
> (i) On one side we want to increase our ability to disseminate publication 
> (work material) to the outside world (Semantic SEO). We need on one side, 
> some assistance in describing our content resources. This is where Stanbol 
> would shine.
>
> (i)(1) Suggesting, topics to tag the content. Obviously this topic entities 
> must either be directly entities taken from authority website such as 
> DbPedia, NYTimes and etc… or or own, in the later case we will have to link 
> them with owl:sameAS to external entities of our choices. Here i’m a bit 
> uncertain as to when to use SKOS concept or real entities ? meaning knowledge 
> graph vs taxonomy. We could have an internal knowledge graph that we link or 
> the same with a taxonomy. A combination could be used as well, by separating 
> the vocabulary and/properties that we would use for SKOS concept and real 
> entities, and using foaf:focus to relate the entity and the concept.

You could model you internal categories by using SKOS and linke them
(e.g. with owl:sameAs or schema:sameAs or ...) with Entities from
external datasets. So you would have a clear hierarchy (as defined by
your Org) that is linked with the knowledge graph of the world.

>
>
> (i)(2)  I’m not sure of the second case anymore, which is not topic 
> assignment but simply entity linking based on their presence in the content. 
> I’m not sure if this would be relevant anymore, based on what you guys have 
> said so far (depend on the answer to my second point above). Sound like 
> entity linking, is strongly related to the way the stanbol eco system 
> envision the semantic search.
> Maybe it could serve the purpose of tagging the text of your content in an 
> HTML file (again some triple transformation for the right description would 
> be necessary IMHO). However i doubt that it would be useful in the context of 
> a PDF for instance. We want to describe the resource as a whole here. It is 
> publication material that we expose. “"The only way i see it useful for a PDF 
> is in the context of semantic search””” which lead me to my last point.
>
> (ii)(1) On the other side we want to provide people with the ability to find 
> information within our website such that the search would be more tailored, 
> domain specific, etc… than using google. This is where semantic search comes 
> into play. Now with what i understood from your reading, i’m not sure exactly 
> how to use stanbol result. I’m sure the topic can strongly help. However how 
> about entity linking? If i understood well no transformation to the 
> description is needed. You keep the information as is and you build your 
> search on it. Because you have a semantic description of what is inside the 
> content and where inside of it?
>

Exactly, Entity Linking shines for semantic search especially with
custom vocabularies. It is also much more relevant for shorter texts
(compared with 100+ pages pdf files). However when you split longer
documents into shorter parts Entity Linking becomes again more
relevant.

>
> I hope this is not too much information. However i figure out that, taking 
> the time, to give the background the intent and my specific questions, could 
> in better understanding my actual questions.
>

best
Rupert


[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhubdereference#advanced-dereference-configurations

>
> Many thanks,
>
> Maat
>
>
>
> --
> Maatari Daniel Okouya
> Sent with Airmail
>
> On 25 Sep 2014 at 00:27:43, Bhoomin Pandya (bhoominpan...@gmail.com) wrote:
>
> Hi Mataari;:
>
> I am trying to answer from what I gather from your question and
> discussion with Rupert. I agree completely with Rupert. Please,
> correct me if you feel I am wrong in any ways.
>
> Stanbol has many amazing features and offers end to end enhancement
> making it complete enhancement engine with RESTful API.
>
> Enhancer generated entities could be used in various ways. These
> entities also would be very useful in Keywords and SEO level
> improvements. Enhancer/Contenthub gives you an output RDF/XML &
> N-Triples with enhancements for text data directly which is very
> unique feature. N-Triples help you construct the data even if subject,
> predicate and object are stored on different servers. This is a great
> feature for linking besides SPARQL endpoints.
>
> On Stanbol OWL files are manged by using Ontonet for later being
> consumed by reasoning services, refactorers, rule engines. OWL is
> knowledge representation and uses Taxonomy and Hierarchy etc for
> classification/categorization (as Rupert mentioned). eg. Site Map
> could be constructed using Ontologies.
>
> SKOS is knowledge organization (front end, middle ware and back-end
> level) at concept level which includes collection of concepts,
> relationship between them, and it covers all controlled vocabularies.
> It is in fact Semantic Intergration.
>
> Your problems seems to be linked to some deployment process either a
> site or for site SEO. To me the basics used for content enhancement is
> also useful for SEO, too. You can use the schema.org vocabulary along
> with the microdata, RDFa, or JSON-LD formats to add information to
> your HTML content ( as it is mentioned on their site). I feel the
> schema.org is more for the html content rather than producing triples
> from the content which please check.
>
> I hope this solves your query. Please, let me know your views.
>
> Many Thanks
> Bhoomin Pandya
>
> On Thu, Sep 25, 2014 at 4:37 AM, Maatari Daniel Okouya
> <okouy...@yahoo.fr> wrote:
>> It all make sense now thanks.
>>
>> One thing that I do not really comprehend, although i may have an idea of 
>> how to hack it, is how do you go from the standard description based on the 
>> Standbol vocabulary, to producing let say a triple with schemer.org 
>> vocabulary: ex:resource schema:about dbPedia:Bob_Marley.
>>
>> I would just appreciate to understand the vision behind it. That is, how did 
>> the Stanbol team envision the best practice to produce that.
>>
>> Is it something that the application that use Stanbol should do. Upon 
>> receiving some suggestion and being ok with the Tag, the resource should be 
>> marked as such. But should the new triples go back in StanBol, or another 
>> store, i’m not sure.
>>
>> I’m not sure to properly understand how the tagging of the enhancer is used. 
>> is it dedicated at developer to next build, an appropriate description based 
>> on it? Because if one want to optimise his description for search engine 
>> optimisation, then the Stanbol descriptions are invisible to google for 
>> instance.
>>
>>
>> Could anyone help to clarify the main idea here.
>>
>> Many thanks,
>>
>> M
>>
>>
>> --
>> Maatari Daniel Okouya
>> Sent with Airmail
>>
>> On 23 Sep 2014 at 08:09:27, Rupert Westenthaler 
>> (rupert.westentha...@gmail.com) wrote:
>>
>> Hi Maatari
>>
>> Not sure if I fully understand your questions ...
>>
>> ad (1), (2):
>>
>> * Entity Linking does use "surface forms" to detect mentions of those.
>> "Surface Forms" are the strings used to refer to an Entity within a
>> text. So typically the labels of an Entity are used as "Surface Forms"
>> for linking.
>> * Named Entity Recognition is most often done with Machine Learning.
>> However also some rule based systems are in use. In case of Machine
>> Learning you need to provide a training set. So if you want to detect
>> Entities of a specific type you will need to provide a training set.
>> Annotating ~1000 occurrences and you will start to get a useable
>> model.
>> * For Categorization it is the same. The classifiers used for this
>> task also require training data. You will need to manually classify
>> documents for your categories. In this case think about ~40 documents
>> per category.
>>
>> ad (3): Sorry I do not understand your question. Just let me answer to
>>
>>> Can the enhancement indeed, categorise according to non-skos instance, that 
>>> are in an external dataset?
>>
>> The TopicAnnotationEngine [1] in Stanbol does not require SKOS. You
>> can also define concepts by names (see page 7). SKOS is supported (see
>> page 8) but not required. The critical thing is not to define the
>> concepts but to provide the training data ^^
>>
>> best
>> Rupert
>>
>> [1] http://stanbol.apache.org/presentations/Topic-Classification.pdf
>>
>>
>> On Mon, Sep 22, 2014 at 4:37 PM, Maatari Daniel Okouya
>> <okouy...@yahoo.fr> wrote:
>>> I understand better.
>>>
>>> I think the key sentence here was: “Important is that Entity Linking 
>>> requires an actual mention of the
>>> Entity in the text while categories do not depend on such mentions. "
>>>
>>>
>>> -So basically wether the category is based on a SKOS DataSet or Not, this 
>>> does not matter at all !!!
>>>
>>> -In both case they link to a dataset, it does not matter if it is SKOS 
>>> based or not. The difference is how the entity to which we link comes up.
>>>
>>>
>>>
>>> Few questions here if you don’t mind. I’m not trying to reemployment things 
>>> here, but simply to better understand things so i can use the tool properly.
>>>
>>>
>>> 1) How would the information of a specific category set be fetch ? The 
>>> process of linking in categorisation must be different, in that you do not 
>>> have the type to guide you. You may well end up with synonyms, without the 
>>> type erros would occurs. I can see why using a controlled vocabulary would 
>>> be more easy. There, the disambiguation is within the label directly.
>>> Would you confirm my assumption here ? That categorisation with a Skos 
>>> based dataset (thesarus) is more easy ?
>>>
>>> 2) Is the reason for the Named Entity Recognition to limit itself to these 
>>> three specific Type “Pertinence” ? Also would this type be customisable, 
>>> meaning could you have a bit more types ?
>>>
>>>
>>>
>>> 3) What i want to achieve is describing some content resource according to 
>>> schema.org. For creativeWork, it has the property “schema:about” which must 
>>> point to a “schema:Thing”. I presume by that, google is expecting here, 
>>> something else than a controlled Concept. I’m not saying that it is not 
>>> possible. In the sameWay, with FOAF:Topic that i would also use, I want to 
>>> point to the real thing rather than a control vocabulary Concept. I would 
>>> rather use, dc:subject for the SKOS:Concept. Does it make sense? Can the 
>>> enhancement indeed, categorise according to non-skos instance, that are in 
>>> an external dataset?
>>>
>>>
>>> Many thanks,
>>>
>>> Maatari
>>>
>>>
>>>
>>> --
>>> Maatari Daniel Okouya
>>> Sent with Airmail
>>>
>>> On 22 Sep 2014 at 06:49:14, Rupert Westenthaler 
>>> (rupert.westentha...@gmail.com) wrote:
>>>
>>> Hi Maatari,
>>>
>>> On Mon, Sep 22, 2014 at 8:22 AM, Maatari Daniel Okouya
>>> <okouy...@yahoo.fr> wrote:
>>>> I’m a bit confused about few concept. Could someone clarify them a bit.
>>>>
>>>>
>>>> When it comes to assigning some topics to a content resource, what would 
>>>> be the difference between entity linking and categorization ?
>>>>
>>>
>>> First lets explain the terminology as used by Stanbol. For that I will
>>> use a todays headline:
>>>
>>> "Lewis Hamilton not thinking about title after winning Singapore GP"
>>>
>>> Named Entity Recognition: Detects mentions of Entity types within the
>>> text. Typically Persons, Organizations and Locations
>>> * Lewis Hamilton -> person
>>> * Singapore -> location
>>>
>>> Entity Linking: Detects mentions of known Entities within the processed Text
>>> * Lewis Hamilton -> http://en.wikipedia.org/wiki/Lewis_Hamilton
>>> * Singapore Grand Prix -> http://en.wikipedia.org/wiki/Singapore_Grand_Prix
>>>
>>> Categorization: Assigns the content to a fixed set of categories.
>>> Categories might be hierarchical. A typical example are the IPTC Media
>>> Topics [1] which I will use for this example.
>>> * sport -> http://cv.iptc.org/newscodes/mediatopic/15000000
>>> * Formula One -> http://cv.iptc.org/newscodes/mediatopic/20000994
>>>
>>> Important is that Entity Linking requires an actual mention of the
>>> Entity in the text while categories do not depend on such mentions.
>>>
>>>> What I see as of now, within some tools well established is the 
>>>> classification part. Usually it makes use of a control vocabulary to 
>>>> classify the content. Output = resource dc:Subject controledVocabularyTerm
>>>>
>>>> However, what i also see in the description of content resource online 
>>>> within some authority website is to link the document to external non skos 
>>>> resource via for instance the Foaf:Topic.
>>>>
>>>> In that second case, do we have both an entity linking and a 
>>>> classification ? or is it that both are the same, it is just that the 
>>>> knowledge base change, from external source to controlled vocabulary. 
>>>> Which would mean that in the world of linked data, content classification 
>>>> / categorization include entity linking? In that case i would say that, 
>>>> the same was happening when linking to a controlled vocabulary term.
>>>>
>>>
>>> IMO the properties used to represent analysis results do not
>>> necessarily indicate if the results express linked entities or
>>> categorizations. Based on the definition both dc:subject and
>>> foaf:topic they should be both used for categories.
>>>
>>>>
>>>> I'm little confused here. If someone, could clarify these notion i would 
>>>> appreciate.
>>>
>>> hope this helps
>>> best
>>> Rupert
>>>
>>> [1] http://cv.iptc.org/newscodes/mediatopic
>>>
>>> --
>>> | Rupert Westenthaler rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11 ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> | REDLINK.CO 
>>> ..........................................................................
>>> | http://redlink.co/
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>> | REDLINK.CO 
>> ..........................................................................
>> | http://redlink.co/



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Re: entity linking vs classification

Reply via email to