Hi Abhishek, Thiago,
please also note that
http://spotlight.dbpedia.org/rest/annotate
<http://spotlight.dbpedia.org/rest/annotate?text=First>
does not run the current statistical version of Spotlight but the old
Lucene version. You can check the current statistical version via the demo
[1] or the endpoint URL at the bottom of that page.
We should change that but I don't have access to that server, unfortunately.
Best,
Jo
[1] http://dbpedia-spotlight.github.io/demo/
On Tue, Mar 17, 2015 at 1:10 PM, Abhishek Gupta <[email protected]> wrote:
> Hi Thiago,
>
> Sorry for the delay!
> I have set up the spotlight server and it is running perfectly fine but
> with minimal settings. After this set up I played with spotIight server
> during which I came across some discrepancies as follows:
>
> Example taken:
> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
> the world. In 1990 German reunification took place in whole Germany in
> which the city regained its status as the capital of Germany.
>
> 1) If we run this we annotate "13th Century" to "
> http://dbpedia.org/page/19th_century". This might be happening because
> the context is very much from 19th century and moreover in "13th Century"
> and "19th Century" there is minimal syntactic difference (one letter).
> But I am not sure whether this is good or bad.
> In my opinion if we have an entity in our store (
> http://dbpedia.org/page/13th_century) which is perfectly matching with
> surface form in raw text ("13th Century") we should have annotated SF to
> the entity.
> And same might be the case with "Germany" which is associated to "History
> of Germany <http://dbpedia.org/page/History_of_Germany>" not "Germany
> <http://dbpedia.org/page/Germany>".
>
> 2) We are spotting "place" and associating it with "Portland Place
> <http://dbpedia.org/resource/Portland_Place>", maybe due to stemming SF.
> And even "Location (geography)
> <http://dbpedia.org/page/Location_(geography)>" is not the correct entity
> type for this. This is because we are not able to detect the sense of the
> word "place" itself. So for that we may have to use word senses like from
> Wordnet etc.
>
> 3) We are detecting ". Berlin" as a surface form. But I don't came to
> know where this SF comes from. And I suspect this SF doesn't come from the
> Wikipedia.
>
> 4) We spotted "capital of Germany" but I didn't get any candidates if we
> run for "candidates" instead of "annotate".
>
> 5) We are able to spot "1920s" as a surface form but not "1920".
>
> Few more questions:
> 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
> in raw text? Because in the above link I found "documented" (a word not a
> noun or entity) annotated to "http://dbpedia.org/resource/Document".
>
> 2) Are we using surface forms to deal with only syntactic references (e.g.
> surface form "municipality" referring to "Municipality
> <http://dbpedia.org/page/Municipality>" or "Metropolitan_municipality
> <http://dbpedia.org/page/Metropolitan_municipality>" or "
> Municipalities_of_Mexico
> <http://dbpedia.org/page/Municipalities_of_Mexico>") or both, syntactic
> and semantic references (e.g. aliases like "Third Reich" referring to "Nazi
> Germany <http://dbpedia.org/page/Nazi_Germany>")?
>
> I am working on generating extra possible surface forms from
> a canonical surface form or the entity itself to deal with unseen SF
> association problems.
> I have also started working on my proposal will also submit it soon.
>
> Thanks,
> Abhishek
>
> On Thu, Mar 12, 2015 at 8:20 PM, Thiago Galery <[email protected]> wrote:
>
>> Hi Abhishek, thanks for the contribution. Your suggestions are pretty
>> much aligned with what we where thinking in any event, and the initial plan
>> seems good.
>> On the assumption that there's some code that generates extra possible
>> surface forms from a cannonical surface form, like your 'Michael Jordan' ->
>> 'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
>> literature on Machine Translation on how to establish some score for the
>> surface form. That is, if you spot 'M Jordan' on the text, what is the
>> probability of it being a translation of the canonical name 'Michael
>> Jordan' . If there's a simple way to implement this, we could try to get
>> the raw data with counts, generate some extra sfs in a principle manner and
>> use that to calculate probabilities. Still for the moment, I'd focus on
>> setting the spotlight server up and play with the warm up tasks.
>> Thanks for the good work,
>> Thiago
>>
>>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc