Hi all,
I would like to present my views to your questions regarding project ideas
5.15 (DBpedia Spotlight - Better Context Vectors), 5.16 (DBpedia Spotlight
- Better Surface form Matching) and would like to raise some also.
Regarding Project 5.15 (DBpedia Spotlight - Better Context Vectors):
*Does smoothing/pruning offer a significant improvement on Spotlight's
performance?*
Smoothing is always recommended whenever we employ a method which
incorporates counting of objects (words, bigrams, etc,) in any
probabilistic modelling of context vectors and there are a lot of them
which we can investigate from simple one like (add-one) Laplace smoothing
to advanced ones like Good-Turing, Kneser-Ney and back-off models which
were mentioned in NLP course on coursera by Stanford University. As
smoothing in most of the cases does improve test results but it may not be
very significant.
And as far as pruning of word vectors is concerned it may or may not
improve our results as it will depend on the task we are considering. We
usually find more dimensions producing better results but as shown on page
14 of word2vec NIPS slides [1] CBOW model with 300 dimensions is performing
better than Skip-gram model with 1000 dimensions. But one thing that we
must consider over here is that smaller dimension does save our time and
space.
*What distributional methods can be used to represent context (e.g.
word2vec / Glove)? Do they offer a significant performance improvement?*
For the representation of context we have following options:
1) Matrix Factorization Models - Latent Semantic Indexing, Latent Dirichlet
Allocation
2) Clustering Based Models - Brown Clustering (Brown et. al 1992), Exchange
Clustering (MarGn et al. 1998, Clark 2003)
3) Distributed Representation: word2vec (Continuous Skip-Gram Model,
Continuous Bag-of-Words Model).
4). Log Bi linear Model - Glove
In these all these I would prefer word2vec because we have Recursive Neural
Network based algorithm (Socher et. al 2014) for representing phrases i.e.
context itself in word vector space.
*Is there any other metric to intermediate the measured similarity between
entity candidates and the context around the mention?*
Following are some other Metric options that we can employ in matrix
factorization models:
1) Term-document - http://en.wikipedia.org/wiki/Latent_semantic_indexing
2) Term-term - HAL ((Lund and Burgess 1996), Entropy based COALS method
(Rohde et. al. 2006), PPMI based method (Bullinaria and Levy), Hellinger
PCA (Lebret and Collobert et. al 2014)
Almost all of these methods are mentioned in Glove with their comparison
with different datasets over different tasks.
I would like to ask a few questions:
1) Are we designing these vectors to use in the disambiguation step of
Entity Linking (matching raw text entity to KB entity) or Is there any
other task we have in mind where these vectors can be employed?
2) At present which model is used for disambiguation in dbpedia-spotlight?
3) Are we trying to focus on modelling context vectors for infrequent words
primarily as there might not have enough information hence difficult to
model?
Regarding Project 5.16 (DBpedia Spotlight - Better Surface form Matching):
*How to deal with linguistic variation: lowercase/uppercase surface forms,
determiners, accents, unicode, in a way such that the right generalizations
can be made and some form of probabilistic structured can be determined in
a principled way?*
For dealing with linguistic variations we can calculate lexical translation
probability from all probable name mentions to entities in KB as shown in
Entity Name Model in [2].
*Improve the memory footprint of the stores that hold the surface forms and
their associated entities.*
In what respect we are planning to improve footprints whether in terms of
space or association or something else?
For this project I have a couple of questions in mind:
1) Are we planning to improve the same model that we are using in
dbpedia-spotlight for entity linking?
2) If not we can change the whole model itself to something else like:
a) Generative Model [2]
b) Discriminative Model [3]
c) Graph Based [4] - Babelfy
d) Probabilistic Graph Based
3) Why are we planning to store surface forms with associated entities
instead of finding associated entities during disambiguation itself?
Besides this I would also like to know regarding warm-up task I have to do.
Thanks,
Abhishek Gupta
[1]
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing
[2] https://aclweb.org/anthology/P/P11/P11-1095.pdf
[3]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
[4] http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf
[5] http://www.aclweb.org/anthology/D11-1072
On Tue, Mar 3, 2015 at 2:01 AM, David Przybilla <[email protected]>
wrote:
> Hi Abhisek,
>
> There is a lot of experimentation which can be done with both 5.16 and
> 5.17.
>
> In my opinion the current problem is that the Surface Form(SF) matching is
> a bit poor.
> Mixing the Babelfy Superstring matching with other ideas to make SF
> spotting better could be a great start.
> You can also bring ideas from papers such as [1] in order to address more
> linguistic variations.
>
> It's hard to debate which one is better, however you can mix ideas i.e:
> use superstring matching to greedy match more Surface forms with more
> linguistic variations, while using word2vec in the disambiguation stage.
>
> Feel free to poke me if you would like to discuss in more detail :)
>
>
> [1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>
>
>
>
>
> On Mon, Mar 2, 2015 at 7:21 PM, Abhishek Gupta <[email protected]> wrote:
>
>> Hi all,
>>
>> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
>> should admit one thing that every idea is more interesting than the
>> previous one. While I was looking out for ideas that interests me I found
>> following ideas most fascinating and I wish I could work on all of them but
>> unfortunately I couldn't:
>>
>> 1) 5.1 Fact Extraction from Wikipedia Text
>>
>> 2) 5.9 Keyword Search on DBpedia Data
>>
>> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>>
>> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>>
>> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>>
>> But in all these I found a couple of ideas interlinked, in other words
>> one solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
>> problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
>> raw text to DBpedia entities so as to understand raw text and disambiguate
>> senses or entities. So if we can address these two tasks efficiently then
>> we can solve problems associated with these three ideas.
>>
>> Following are some methods which were there in the research papers
>> mentioned in references of these ideas.
>>
>> 1) FrameNet: Identify frames (indicating a particular type of situation
>> along with its participants, i.e. task, doer and props), and then identify
>> Logical Units, and their associated Frame Elements by using models trained
>> primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
>> Labeling.
>>
>> 2) Babelfy: Using a wide semantic network, encoding structural and
>> lexical information of both type encyclopedic and lexicographic like
>> Wikipedia and WordNet resp., we can also accomplish our tasks (EL and WSD).
>> In this a graphical method along with some heuristics is used to extract
>> out the most relevant meaning from the text.
>>
>> 3) Word2vec / Glove - Methods for designing word vectors based on the
>> context. These are primarily employed for WSD.
>>
>> Moreover if those problems are solved then we can address keyword search
>> (5.9) and Confidence Scoring (5.19) effectively as both require association
>> of entities to the raw text which will provide concerned entity and its
>> attributes to search with and the confidence score.
>>
>> So I would like to work on 5.16 or 5.17 which will encompass those two
>> tasks (EL and WSD) and for this I would like to ask which method will be
>> the best for these two tasks? According to me it is the babelfy method
>> which will be appropriate for both of these tasks.
>>
>> Thanks,
>> Abhishek Gupta
>> On Feb 23, 2015 5:46 PM, "Thiago Galery" <[email protected]> wrote:
>>
>>> Hi Abishek, if you are interested in contributing to any DBpedia project
>>> or participating in Gsoc this year it might be a good idea to take a look
>>> at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might help
>>> you to specify how/where you can contribute. Hope this helps,
>>> Thiago
>>>
>>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>>>> Delhi. Recently I have worked on the projects related to Machine Learning
>>>> and Natural Language Processing (i.e. Information Extraction) in which I
>>>> extracted Named Entities from raw text to populate knowledge base with new
>>>> entities. Hence I am inclined to work in this area. Besides this I am also
>>>> familiar with programming languages like C, C++ and Java primarily.
>>>>
>>>> So I presume that I can contribute a lot towards extracting structured
>>>> data from wikipedia which is one of the primary step towards Dbpedia's
>>>> primary goal.
>>>>
>>>> So can anyone please help me out where to start from so as to
>>>> contribute towards this?
>>>>
>>>> Regards
>>>> Abhishek Gupta
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>> with Interactivity, Sharing, Native Excel Exports, App Integration &
>>>> more
>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-gsoc mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>
>>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Dbpedia-gsoc mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc