[Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-17 Thread Thiago Galery
-- Forwarded message --
From: Thiago Galery tgal...@gmail.com
Date: Tue, Mar 17, 2015 at 11:29 AM
Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia
To: Abhishek Gupta a.gu...@gmail.com


Hi Abishek, thanks for the work, here are some answers:

On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi Thiago,

 Sorry for the delay!
 I have set up the spotlight server and it is running perfectly fine but
 with minimal settings. After this set up I played with spotIight server
 during which I came across some discrepancies as follows:

 Example taken:
 http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
 the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
 Reich (1933–45). Berlin in the 1920s was the third largest municipality in
 the world. In 1990 German reunification took place in whole Germany in
 which the city regained its status as the capital of Germany.

 1) If we run this we annotate 13th Century to 
 http://dbpedia.org/page/19th_century;. This might be happening because
 the context is very much from 19th century and moreover in 13th Century
 and 19th Century there is minimal syntactic difference (one letter).
 But I am not sure whether this is good or bad.


This might be due to either 13th Century being wrongly linked to 19th
century, or maybe the word century being linked to many different
centuries which then causes a disambiguation error due to the context. I
think your example is a counter-example to the way we generate the data
structures used for disambiguation.


 In my opinion if we have an entity in our store (
 http://dbpedia.org/page/13th_century) which is perfectly matching with
 surface form in raw text (13th Century) we should have annotated SF to
 the entity.
 And same might be the case with Germany which is associated to History
 of Germany http://dbpedia.org/page/History_of_Germany not Germany
 http://dbpedia.org/page/Germany.


In this case other factors might have crept in, in could be that Germany
has a bigger number of inlinks or some other metric that allows it to
overtake the most natural candidate.



 2) We are spotting place and associating it with Portland Place
 http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF.
 And even Location (geography)
 http://dbpedia.org/page/Location_(geography) is not the correct entity
 type for this. This is because we are not able to detect the sense of the
 word place itself. So for that we may have to use word senses like from
 Wordnet etc.


The sf spottling pipeline works a bit like this, you get a candidate SF,
like 'Portland Place' and see if there's a candidate for that, but you also
consider n-gram subparts, so it could have retrieved the candidates
associated with place instead.



 3) We are detecting . Berlin as a surface form. But I don't came to
 know where this SF comes from. And I suspect this SF doesn't come from the
 Wikipedia.


Although . Berlin is highlighted, the entity is matched on Berlin, the
extra space and punctuation comes from the way we tokenize sentences. We
have chosen to use a language independent tokenizer using a break iterator
for speed and language independence, but it hasn't been tested very well.
This is the area which explains this mistake and help in it is much
appreciated.



 4) We spotted capital of Germany but I didn't get any candidates if we
 run for candidates instead of annotate.


This might be due to a default confidence score. If you pass the extra
confidence param and set it to 0, you will probably see everything, e.g.
/candidates/?confidence=0text=
In fact, I suggest you to see all the candidates in the text you used to
confirm (or not) what I've been saying here.



 5) We are able to spot 1920s as a surface form but not 1920.


This is due to the generation /stemming of sfs we have been discussed, but
I'm not sure that is a bad example. 1920 if used as a year might no mean
the same as 1920s.



 Few more questions:
 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
 in raw text? Because in the above link I found documented (a word not a
 noun or entity) annotated to http://dbpedia.org/resource/Document;.


There are two main spotters, the default one that uses a finite state
automaton generated from the surface form store to match incoming words as
valid sequence of states (so in this sense everything goes through the
pipeline), another that uses a opennlp spotter that gets Sfs from a NE
extractor. Both might generate single noun n-grams. In this case, it could
be that there is a link in wikipedia documented - Document, which might
introduce documented as a valid state in the FSA.


 2) Are we using surface forms to deal with only syntactic references (e.g.
 surface form municipality referring to Municipality
 http://dbpedia.org/page/Municipality or Metropolitan_municipality
 http

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-08 Thread Thiago Galery

 3) 5.16 DBpedia Spotlight - Better Context Vectors

 4) 5.17 DBpedia Spotlight - Better Surface form Matching

 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores

 But in all these I found a couple of ideas interlinked, in other words
 one solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
 problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
 raw text to DBpedia entities so as to understand raw text and disambiguate
 senses or entities. So if we can address these two tasks efficiently then
 we can solve problems associated with these three ideas.

 Following are some methods which were there in the research papers
 mentioned in references of these ideas.

 1) FrameNet: Identify frames (indicating a particular type of situation
 along with its participants, i.e. task, doer and props), and then identify
 Logical Units, and their associated Frame Elements by using models trained
 primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
 Labeling.

 2) Babelfy: Using a wide semantic network, encoding structural and
 lexical information of both type encyclopedic and lexicographic like
 Wikipedia and WordNet resp., we can also accomplish our tasks (EL and WSD).
 In this a graphical method along with some heuristics is used to extract
 out the most relevant meaning from the text.

 3) Word2vec / Glove - Methods for designing word vectors based on the
 context. These are primarily employed for WSD.

 Moreover if those problems are solved then we can address keyword search
 (5.9) and Confidence Scoring (5.19) effectively as both require association
 of entities to the raw text which will provide concerned entity and its
 attributes to search with and the confidence score.

 So I would like to work on 5.16 or 5.17 which will encompass those two
 tasks (EL and WSD) and for this I would like to ask which method will be
 the best for these two tasks? According to me it is the babelfy method
 which will be appropriate for both of these tasks.

 Thanks,
 Abhishek Gupta
 On Feb 23, 2015 5:46 PM, Thiago Galery tgal...@gmail.com wrote:

 Hi Abishek, if you are interested in contributing to any DBpedia
 project or participating in Gsoc this year it might be a good idea to take
 a look at this page http://wiki.dbpedia.org/gsoc2015/ideas . This
 might help you to specify how/where you can contribute. Hope this helps,
 Thiago

 On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi all,

 I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
 Delhi. Recently I have worked on the projects related to Machine Learning
 and Natural Language Processing (i.e. Information Extraction) in which I
 extracted Named Entities from raw text to populate knowledge base with new
 entities. Hence I am inclined to work in this area. Besides this I am also
 familiar with programming languages like C, C++ and Java primarily.

 So I presume that I can contribute a lot towards extracting structured
 data from wikipedia which is one of the primary step towards Dbpedia's
 primary goal.

 So can anyone please help me out where to start from so as to
 contribute towards this?

 Regards
 Abhishek Gupta


 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and
 Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration 
 more
 Get technology previously reserved for billion-dollar corporations,
 FREE

 http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk
 ___
 Dbpedia-gsoc mailing list
 Dbpedia-gsoc@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc




 --
 Dive into the World of Parallel Programming The Go Parallel Website,
 sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub
 for all
 things parallel software development, from weekly thought leadership
 blogs to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Dbpedia-gsoc mailing list
 Dbpedia-gsoc@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc




--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-11 Thread Thiago Galery
 the candidate entities to the function mentioned in step 2 of
 Approach 1.
 4) Pass SoT from the same function (part (a) and (b))
 5) Score candidates using Levenshtein Distance

 Actually in Approach 2 we are doing a bit of disambiguation in Step 1
 itself which will reduce our count of sFFs.

 Please review these ideas and provide your feedback.


I'm not sure whether I understand this entirely, but I'm very interested in
other ways to conceptualise context. Spotlight just uses a simple
distributional method, but you can definitely use the link structure within
wikipedia to find candidates that are more related to themselves. In your
example above the pair Motorcycle - Harley Davidson would be much more
related than Motorcycle - Hard Drive for example. However, this would
require coding from scratch, so bear in mind that it might be too much
work.


 Moreover I am trying to setting up the server on my PC itself which is
 taking some time due to a 10Gb file. I will come up with results as soon as
 I got some results. Till then I might follow up with some other warm-up
 task which is related to project ideas 5.15 and 5.16.

 Regards,
 Abhishek


Let us know if you need any help.

All the best,

Thiago Galery


 On Sun, Mar 8, 2015 at 11:47 PM, Thiago Galery tgal...@gmail.com wrote:

 Hi Abhishek, here are some thoughts about some of your questions:

 I would like to ask a few questions:
 1) Are we designing these vectors to use in the disambiguation step of
 Entity Linking (matching raw text entity to KB entity) or Is there any
 other task we have in mind where these vectors can be employed?



 The main focus would be disambiguation, but one could reuse the
 contextual score of the entity to determine how relevant that entity is for
 the text.



 2) At present which model is used for disambiguation in
 dbpedia-spotlight?



 Correct me if I am wrong, but I think that disambiguation is done by
 cosine similarity (on term frequency) between the context surrounding the
 extracted surface form and the context associated with each candidate
 entity associated with that surface form.


 3) Are we trying to focus on modelling context vectors for infrequent
 words primarily as there might not have enough information hence difficult
 to model?


 The problem is not related to frequent words per se, but more about how
 the context for each entity is determined. The map reduce job that
 generates the stats used by spotlight extracts the surrounding words
 (according to a window and other constraints) of each link to an entity and
 counts them, which means that heavily linked entities have a larger context
 than no so frequently linked ones. This creates a heavy bias for
 disambiguating certain entities, hence a case where smoothing might be a
 good call.





 Regarding Project 5.16 (DBpedia Spotlight - Better Surface form Matching
 ):

 *How to deal with linguistic variation: lowercase/uppercase surface
 forms, determiners, accents, unicode, in a way such that the right
 generalizations can be made and some form of probabilistic structured can
 be determined in a principled way?*
 For dealing with linguistic variations we can calculate lexical
 translation probability from all probable name mentions to entities in KB
 as shown in Entity Name Model in [2].

 *Improve the memory footprint of the stores that hold the surface forms
 and their associated entities.*
 In what respect we are planning to improve footprints whether in terms
 of space or association or something else?

 For this project I have a couple of questions in mind:
 1) Are we planning to improve the same model that we are using in
 dbpedia-spotlight for entity linking?


 Yes


 2) If not we can change the whole model itself to something else like:
 a) Generative Model [2]
 b) Discriminative Model [3]
 c) Graph Based [4] - Babelfy
 d) Probabilistic Graph Based


 Incorporating something like (c) or (d) might be a good call, but might
 be way bigger than one summer.



 3) Why are we planning to store surface forms with associated entities
 instead of finding associated entities during disambiguation itself?


 No sure what you mean by that.


 Besides this I would also like to know regarding warm-up task I have to
 do.


 If you check the pull request page in spolight, @dav009 has a PR which he
 claims to be a mere *Idea* but forces surface forms to be stemmed before
 storing. Pulling from that branch, recompiling, running spotlight and
 seeing some of the results would be a good start. You can also nag us on
 that issue about ideas you might have after you understand the code.




 Thanks,
 Abhishek Gupta

 [1]
 https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing
 [2] https://aclweb.org/anthology/P/P11/P11-1095.pdf
 [3]
 http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
 [4]
 http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf
 [5] http://www.aclweb.org

Re: [Dbpedia-gsoc] GSOC 2015 - Introduction

2015-03-23 Thread Thiago Galery
Hi Vasanth, I suggest you taking a look at the previous messages in the
mailing list archives and check out the discussion there, so you have a
better idea of what to do. Bare in mind that submission date is really
close, so you'd need to look into this asap.
All the best,
Thiago

On Mon, Mar 23, 2015 at 5:07 PM, Vasanth Kalingeri 
vasanth.kaling...@gmail.com wrote:

 Hi,
 My name is Vasanth Kalingeri. I am a 3rd year undergrad in
 computer science, pursuing my engineering in SJCE Mysore. I have completed
 a course on machine learning in Coursera, which further lead me into an
 interest towards NLP. I am also freelancing since 2 years.
 My interest for NLP grew primarily when I wanted a knowledge base
 from a given corpus of text, so that it could answer questions on the
 corpus. This lead me to dbpedia and further into the topic 5.1.
 I am extremely interested in building such a system to extract
 facts from a corpus. Will get working on the warmup tasks soon.
 Regards,
 Vasanth


 --
 Dive into the World of Parallel Programming The Go Parallel Website,
 sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub for
 all
 things parallel software development, from weekly thought leadership blogs
 to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Dbpedia-gsoc mailing list
 Dbpedia-gsoc@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-23 Thread Thiago Galery
Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us
commenting there. If you have done already, could you send us the link?
All the best,
Thiago

On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi all,

 Here are some comments for your response:


 Hi Abishek, thanks for the work, here are some answers:

 On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi Thiago,

 Sorry for the delay!
 I have set up the spotlight server and it is running perfectly fine but
 with minimal settings. After this set up I played with spotIight server
 during which I came across some discrepancies as follows:

 Example taken:
 http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
 the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
 Reich (1933–45). Berlin in the 1920s was the third largest municipality in
 the world. In 1990 German reunification took place in whole Germany in
 which the city regained its status as the capital of Germany.

 1) If we run this we annotate 13th Century to 
 http://dbpedia.org/page/19th_century;. This might be happening because
 the context is very much from 19th century and moreover in 13th Century
 and 19th Century there is minimal syntactic difference (one letter).
 But I am not sure whether this is good or bad.


 This might be due to either 13th Century being wrongly linked to 19th
 century, or maybe the word century being linked to many different
 centuries which then causes a disambiguation error due to the context. I
 think your example is a counter-example to the way we generate the data
 structures used for disambiguation.


 In my opinion if we have an entity in our store (
 http://dbpedia.org/page/13th_century) which is perfectly matching with
 surface form in raw text (13th Century) we should have annotated SF
 to the entity.
 And same might be the case with Germany which is associated to History
 of Germany http://dbpedia.org/page/History_of_Germany not Germany
 http://dbpedia.org/page/Germany.


 In this case other factors might have crept in, in could be that Germany
 has a bigger number of inlinks or some other metric that allows it to
 overtake the most natural candidate.



 2) We are spotting place and associating it with Portland Place
 http://dbpedia.org/resource/Portland_Place, maybe due to stemming
 SF. And even Location (geography)
 http://dbpedia.org/page/Location_(geography) is not the correct
 entity type for this. This is because we are not able to detect the sense
 of the word place itself. So for that we may have to use word senses
 like from Wordnet etc.


 The sf spottling pipeline works a bit like this, you get a candidate SF,
 like 'Portland Place' and see if there's a candidate for that, but you also
 consider n-gram subparts, so it could have retrieved the candidates
 associated with place instead.


 I understand what you said but over here I wanted to point out that 
 place is not even a noun and we are trying to associate it with an Named
 Entity which is a noun.





 3) We are detecting . Berlin as a surface form. But I don't came to
 know where this SF comes from. And I suspect this SF doesn't come from the
 Wikipedia.


 Although . Berlin is highlighted, the entity is matched on Berlin,
 the extra space and punctuation comes from the way we tokenize sentences.
 We have chosen to use a language independent tokenizer using a break
 iterator for speed and language independence, but it hasn't been tested
 very well. This is the area which explains this mistake and help in it is
 much appreciated.


 Thanks for clarification.





 4) We spotted capital of Germany but I didn't get any candidates if
 we run for candidates instead of annotate.


 This might be due to a default confidence score. If you pass the extra
 confidence param and set it to 0, you will probably see everything, e.g.
 /candidates/?confidence=0text=
 In fact, I suggest you to see all the candidates in the text you used to
 confirm (or not) what I've been saying here.


 I tried to do that but I still didn't get any Entity Candidate for capital
 of Germany.




 5) We are able to spot 1920s as a surface form but not 1920.


 This is due to the generation /stemming of sfs we have been discussed,
 but I'm not sure that is a bad example. 1920 if used as a year might no
 mean the same as 1920s.


 This was my mistake.





 Few more questions:
 1) Are we trying to annotate every word, noun or entity(e.g. proper
 noun) in raw text? Because in the above link I found documented (a word
 not a noun or entity) annotated to http://dbpedia.org/resource/Document
 .


 There are two main spotters, the default one that uses a finite state
 automaton generated from the surface form store to match incoming words as
 valid sequence of states (so in this sense everything goes through the
 pipeline), another that uses 

Re: [Dbpedia-gsoc] Re-Introduction

2016-03-01 Thread Thiago Galery
Hi Felix, welcome back. We might add a few warm up tasks soon. Phillip's
project might be merged on the dev branch soon and will provide a good
basis for future additions. If you are interested in spotlight, are there
any ideas on what aspects of it you'd like to concentrate on ?

On Tue, Mar 1, 2016 at 12:36 PM, Marco Fossati 
wrote:

> Hey Felix, welcome back!
>
> Marco
>
> On 3/1/16 02:13, Felix Sonntag wrote:
> > Hi everyone,
> >
> > I’m Felix, I already introduced myself last year, but I guess I’ll
> shortly reintroduce myself. I’m a Master student in Informatics at TUM in
> Munich. I’m pretty excited about the DBpedia project: I’m using Spotlight
> for a project about analyzing artist data at the moment, and I’ve used
> DBpedia data for an app last year. I already tried to participate in GSOC
> with you last year, unfortunately it didn’t work out (apparently it was
> really close :P).
> >
> > I’ve just finished my first Master semester with putting a focus on ML,
> Data Analytics and NLP in my studies.
> >
> > There are some project ideas I’m keen on working, but I’ll directly post
> on the project sites.
> >
> > One general question: for the Spotlight project there exists only a
> rough idea by Philipp, and there are also no warm up tasks. Can we expect
> more from that? :)
> >
> > Best,
> > Felix
> >
> --
> > Site24x7 APM Insight: Get Deep Visibility into Application Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> > ___
> > Dbpedia-gsoc mailing list
> > Dbpedia-gsoc@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
> >
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc