I found that the statistical endpoint gave me even more noise, actually.
Your comments about types confused me more - my question about that is
really this: what do I put into the 'types' parameter?
An example result I don't want to see:
{"@URI":"http://dbpedia.org/resource/Hypertext_Transfer_Protocol
","@offset":"153","@percentageOfSecondRank":"-1.0","@similarityScore":"0.05355825275182724","@support":"248","@surfaceForm":"http","@types":"Freebase:/internet/protocol,Freebase:/internet,Freebase:/computer/internet_protocol,Freebase:/computer,Freebase:/internet/api"}
surfaceForm: http
URI: http://dbpedia.org/resource/Hypertext_Transfer_Protocol
Would I blacklist "Freebase:/internet" and "Freebase:/computer"? Those
don't appear in the ontology pointed to by the documentation.
Betsey Benagh
Boston Fusion Corp.
1 Van de Graaff Drive, Ste 107
Burlington, MA 01803-5176
[email protected]
617-583-5730 x106 (office)
781-367-6720 (mobile)
On Fri, Aug 15, 2014 at 10:15 AM, David Przybilla <[email protected]>
wrote:
> Hi Betsey,
>
> One thing to take into account is that currently the "confidence" value
> passed as a parameter affects both the spotter and the disambiguation. So
> whenever you set the threshold too low, you will spot a lot of irrelevant
> things, which then get passed to the disambiguator which will also have a
> low filter threshold.
>
> There are another few reasons and issues (check github) why you might not
> be getting interesting entities with high confidence values, as an example
> ( using your sample-text ) you might not be getting the entitity "Barack
> Obama" from the surfaceForm 'Obama" due to the discount mechanism used
> when generating the models, so since "Obama" is a sub-sequence of other
> surface forms such as "Barack Obama" it might have got discounted enough to
> have a very little probability.
>
> I encourage you to give a try to spotlight 0.6 (statistical version).
> Depending on your use case you might get less noise, but you might need
> more memory/processing power.
> Models and jar are available here [1]
>
> I've got no idea about querying dbpedia or how dbpedia is structured, but
> given a dbpedia_id you can match it to a freebase_id and ask freebase if
> the current entity is either a type '/people/person' or '/time/event' .
> Within the types you mention you should have no problems, since freebase
> has a very good coverage of people, events and locations.
>
> [1] http://spotlight.sztaki.hu/downloads/version-0.1/
>
>
>
>
>
> On Fri, Aug 15, 2014 at 2:29 PM, Betsey Benagh <
> [email protected]> wrote:
>
>> Thanks to everyone for the help yesterday with the statistical endpoint.
>>
>> I'm trying to understand how to tune the tool to get optimal results.
>>
>> When I used the example text in the demo interface -
>>
>> President Obama called Wednesday on Congress to extend a tax break
>> for students included in last year's economic stimulus package, arguing
>> that the policy provides more generous assistance.
>>
>> A confidence of 0.5 only picked up 'Congress'. Reducing the confidence
>> to 0.3 picked up a lot more stuff - including linking 'Wednesday' to a
>> sports team, which seems bizarre to me.
>>
>> On my own data, which comes from Twitter, I see weird things like
>> mentions of 'police' linking to the musical group The Police, and the word
>> 'celebrate' (in the context of celebrating an anniversary) linking to the
>> Madonna song. If I turn the confidence up, I lose those references, but I
>> also lose 'good' references as well.
>>
>> I feel like whitelisting or blacklisting is the way to go, but I'm having
>> trouble correlating the types I see in my results with the ontology at
>> http://mappings.dbpedia.org/server/ontology/classes/ That ontology
>> particularly confuses me, as it seems very uneven - as an example, under
>> 'Organization', there are classes that make sense to me, like 'Company' and
>> "Sports League', and then oddly specific things like 'Comedy Group' and
>> 'Samba School' at the same level. In my results, there are a mix of types
>> from DBpedia, Schema, and Freebase, and it's not clear to me how I would
>> specify (for example) that I'm interested in people, places, and events,
>> but not musical groups, internet concepts (it always picks up 'http' from
>> embedded links and gives me 'Hypertext Transfer Protocol'), etc.
>>
>> Thanks!
>>
>> Betsey Benagh
>>
>> Boston Fusion Corp.
>> 1 Van de Graaff Drive, Ste 107
>> Burlington, MA 01803-5176
>> [email protected]
>> 617-583-5730 x106 (office)
>> 781-367-6720 (mobile)
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Dbp-spotlight-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users