[Dbpedia-gsoc] Fwd: Contribute to DbPedia
-- Forwarded message -- From: Thiago Galery tgal...@gmail.com Date: Tue, Mar 17, 2015 at 11:29 AM Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia To: Abhishek Gupta a.gu...@gmail.com Hi Abishek, thanks for the work, here are some answers: On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi Thiago, Sorry for the delay! I have set up the spotlight server and it is running perfectly fine but with minimal settings. After this set up I played with spotIight server during which I came across some discrepancies as follows: Example taken: http://spotlight.dbpedia.org/rest/annotate?text=First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. In 1990 German reunification took place in whole Germany in which the city regained its status as the capital of Germany. 1) If we run this we annotate 13th Century to http://dbpedia.org/page/19th_century;. This might be happening because the context is very much from 19th century and moreover in 13th Century and 19th Century there is minimal syntactic difference (one letter). But I am not sure whether this is good or bad. This might be due to either 13th Century being wrongly linked to 19th century, or maybe the word century being linked to many different centuries which then causes a disambiguation error due to the context. I think your example is a counter-example to the way we generate the data structures used for disambiguation. In my opinion if we have an entity in our store ( http://dbpedia.org/page/13th_century) which is perfectly matching with surface form in raw text (13th Century) we should have annotated SF to the entity. And same might be the case with Germany which is associated to History of Germany http://dbpedia.org/page/History_of_Germany not Germany http://dbpedia.org/page/Germany. In this case other factors might have crept in, in could be that Germany has a bigger number of inlinks or some other metric that allows it to overtake the most natural candidate. 2) We are spotting place and associating it with Portland Place http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF. And even Location (geography) http://dbpedia.org/page/Location_(geography) is not the correct entity type for this. This is because we are not able to detect the sense of the word place itself. So for that we may have to use word senses like from Wordnet etc. The sf spottling pipeline works a bit like this, you get a candidate SF, like 'Portland Place' and see if there's a candidate for that, but you also consider n-gram subparts, so it could have retrieved the candidates associated with place instead. 3) We are detecting . Berlin as a surface form. But I don't came to know where this SF comes from. And I suspect this SF doesn't come from the Wikipedia. Although . Berlin is highlighted, the entity is matched on Berlin, the extra space and punctuation comes from the way we tokenize sentences. We have chosen to use a language independent tokenizer using a break iterator for speed and language independence, but it hasn't been tested very well. This is the area which explains this mistake and help in it is much appreciated. 4) We spotted capital of Germany but I didn't get any candidates if we run for candidates instead of annotate. This might be due to a default confidence score. If you pass the extra confidence param and set it to 0, you will probably see everything, e.g. /candidates/?confidence=0text= In fact, I suggest you to see all the candidates in the text you used to confirm (or not) what I've been saying here. 5) We are able to spot 1920s as a surface form but not 1920. This is due to the generation /stemming of sfs we have been discussed, but I'm not sure that is a bad example. 1920 if used as a year might no mean the same as 1920s. Few more questions: 1) Are we trying to annotate every word, noun or entity(e.g. proper noun) in raw text? Because in the above link I found documented (a word not a noun or entity) annotated to http://dbpedia.org/resource/Document;. There are two main spotters, the default one that uses a finite state automaton generated from the surface form store to match incoming words as valid sequence of states (so in this sense everything goes through the pipeline), another that uses a opennlp spotter that gets Sfs from a NE extractor. Both might generate single noun n-grams. In this case, it could be that there is a link in wikipedia documented - Document, which might introduce documented as a valid state in the FSA. 2) Are we using surface forms to deal with only syntactic references (e.g. surface form municipality referring to Municipality http://dbpedia.org/page/Municipality or Metropolitan_municipality http
Re: [Dbpedia-gsoc] Contribute to DbPedia
3) 5.16 DBpedia Spotlight - Better Context Vectors 4) 5.17 DBpedia Spotlight - Better Surface form Matching 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores But in all these I found a couple of ideas interlinked, in other words one solution might leads to another. Like in 5.1, 5.16, 5.17 our primary problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from raw text to DBpedia entities so as to understand raw text and disambiguate senses or entities. So if we can address these two tasks efficiently then we can solve problems associated with these three ideas. Following are some methods which were there in the research papers mentioned in references of these ideas. 1) FrameNet: Identify frames (indicating a particular type of situation along with its participants, i.e. task, doer and props), and then identify Logical Units, and their associated Frame Elements by using models trained primarily on crowd-sourced data. Primarily used for Automatic Semantic Role Labeling. 2) Babelfy: Using a wide semantic network, encoding structural and lexical information of both type encyclopedic and lexicographic like Wikipedia and WordNet resp., we can also accomplish our tasks (EL and WSD). In this a graphical method along with some heuristics is used to extract out the most relevant meaning from the text. 3) Word2vec / Glove - Methods for designing word vectors based on the context. These are primarily employed for WSD. Moreover if those problems are solved then we can address keyword search (5.9) and Confidence Scoring (5.19) effectively as both require association of entities to the raw text which will provide concerned entity and its attributes to search with and the confidence score. So I would like to work on 5.16 or 5.17 which will encompass those two tasks (EL and WSD) and for this I would like to ask which method will be the best for these two tasks? According to me it is the babelfy method which will be appropriate for both of these tasks. Thanks, Abhishek Gupta On Feb 23, 2015 5:46 PM, Thiago Galery tgal...@gmail.com wrote: Hi Abishek, if you are interested in contributing to any DBpedia project or participating in Gsoc this year it might be a good idea to take a look at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might help you to specify how/where you can contribute. Hope this helps, Thiago On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, I am Abhishek Gupta. I am a student of Electrical Engineering from IIT Delhi. Recently I have worked on the projects related to Machine Learning and Natural Language Processing (i.e. Information Extraction) in which I extracted Named Entities from raw text to populate knowledge base with new entities. Hence I am inclined to work in this area. Besides this I am also familiar with programming languages like C, C++ and Java primarily. So I presume that I can contribute a lot towards extracting structured data from wikipedia which is one of the primary step towards Dbpedia's primary goal. So can anyone please help me out where to start from so as to contribute towards this? Regards Abhishek Gupta -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk ___ Dbpedia-gsoc mailing list Dbpedia-gsoc@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ Dbpedia-gsoc mailing list Dbpedia-gsoc@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Dbpedia-gsoc mailing list Dbpedia-gsoc
Re: [Dbpedia-gsoc] Contribute to DbPedia
the candidate entities to the function mentioned in step 2 of Approach 1. 4) Pass SoT from the same function (part (a) and (b)) 5) Score candidates using Levenshtein Distance Actually in Approach 2 we are doing a bit of disambiguation in Step 1 itself which will reduce our count of sFFs. Please review these ideas and provide your feedback. I'm not sure whether I understand this entirely, but I'm very interested in other ways to conceptualise context. Spotlight just uses a simple distributional method, but you can definitely use the link structure within wikipedia to find candidates that are more related to themselves. In your example above the pair Motorcycle - Harley Davidson would be much more related than Motorcycle - Hard Drive for example. However, this would require coding from scratch, so bear in mind that it might be too much work. Moreover I am trying to setting up the server on my PC itself which is taking some time due to a 10Gb file. I will come up with results as soon as I got some results. Till then I might follow up with some other warm-up task which is related to project ideas 5.15 and 5.16. Regards, Abhishek Let us know if you need any help. All the best, Thiago Galery On Sun, Mar 8, 2015 at 11:47 PM, Thiago Galery tgal...@gmail.com wrote: Hi Abhishek, here are some thoughts about some of your questions: I would like to ask a few questions: 1) Are we designing these vectors to use in the disambiguation step of Entity Linking (matching raw text entity to KB entity) or Is there any other task we have in mind where these vectors can be employed? The main focus would be disambiguation, but one could reuse the contextual score of the entity to determine how relevant that entity is for the text. 2) At present which model is used for disambiguation in dbpedia-spotlight? Correct me if I am wrong, but I think that disambiguation is done by cosine similarity (on term frequency) between the context surrounding the extracted surface form and the context associated with each candidate entity associated with that surface form. 3) Are we trying to focus on modelling context vectors for infrequent words primarily as there might not have enough information hence difficult to model? The problem is not related to frequent words per se, but more about how the context for each entity is determined. The map reduce job that generates the stats used by spotlight extracts the surrounding words (according to a window and other constraints) of each link to an entity and counts them, which means that heavily linked entities have a larger context than no so frequently linked ones. This creates a heavy bias for disambiguating certain entities, hence a case where smoothing might be a good call. Regarding Project 5.16 (DBpedia Spotlight - Better Surface form Matching ): *How to deal with linguistic variation: lowercase/uppercase surface forms, determiners, accents, unicode, in a way such that the right generalizations can be made and some form of probabilistic structured can be determined in a principled way?* For dealing with linguistic variations we can calculate lexical translation probability from all probable name mentions to entities in KB as shown in Entity Name Model in [2]. *Improve the memory footprint of the stores that hold the surface forms and their associated entities.* In what respect we are planning to improve footprints whether in terms of space or association or something else? For this project I have a couple of questions in mind: 1) Are we planning to improve the same model that we are using in dbpedia-spotlight for entity linking? Yes 2) If not we can change the whole model itself to something else like: a) Generative Model [2] b) Discriminative Model [3] c) Graph Based [4] - Babelfy d) Probabilistic Graph Based Incorporating something like (c) or (d) might be a good call, but might be way bigger than one summer. 3) Why are we planning to store surface forms with associated entities instead of finding associated entities during disambiguation itself? No sure what you mean by that. Besides this I would also like to know regarding warm-up task I have to do. If you check the pull request page in spolight, @dav009 has a PR which he claims to be a mere *Idea* but forces surface forms to be stemmed before storing. Pulling from that branch, recompiling, running spotlight and seeing some of the results would be a good start. You can also nag us on that issue about ideas you might have after you understand the code. Thanks, Abhishek Gupta [1] https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing [2] https://aclweb.org/anthology/P/P11/P11-1095.pdf [3] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf [4] http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf [5] http://www.aclweb.org
Re: [Dbpedia-gsoc] GSOC 2015 - Introduction
Hi Vasanth, I suggest you taking a look at the previous messages in the mailing list archives and check out the discussion there, so you have a better idea of what to do. Bare in mind that submission date is really close, so you'd need to look into this asap. All the best, Thiago On Mon, Mar 23, 2015 at 5:07 PM, Vasanth Kalingeri vasanth.kaling...@gmail.com wrote: Hi, My name is Vasanth Kalingeri. I am a 3rd year undergrad in computer science, pursuing my engineering in SJCE Mysore. I have completed a course on machine learning in Coursera, which further lead me into an interest towards NLP. I am also freelancing since 2 years. My interest for NLP grew primarily when I wanted a knowledge base from a given corpus of text, so that it could answer questions on the corpus. This lead me to dbpedia and further into the topic 5.1. I am extremely interested in building such a system to extract facts from a corpus. Will get working on the warmup tasks soon. Regards, Vasanth -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ Dbpedia-gsoc mailing list Dbpedia-gsoc@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Dbpedia-gsoc mailing list Dbpedia-gsoc@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia
Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us commenting there. If you have done already, could you send us the link? All the best, Thiago On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, Here are some comments for your response: Hi Abishek, thanks for the work, here are some answers: On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi Thiago, Sorry for the delay! I have set up the spotlight server and it is running perfectly fine but with minimal settings. After this set up I played with spotIight server during which I came across some discrepancies as follows: Example taken: http://spotlight.dbpedia.org/rest/annotate?text=First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. In 1990 German reunification took place in whole Germany in which the city regained its status as the capital of Germany. 1) If we run this we annotate 13th Century to http://dbpedia.org/page/19th_century;. This might be happening because the context is very much from 19th century and moreover in 13th Century and 19th Century there is minimal syntactic difference (one letter). But I am not sure whether this is good or bad. This might be due to either 13th Century being wrongly linked to 19th century, or maybe the word century being linked to many different centuries which then causes a disambiguation error due to the context. I think your example is a counter-example to the way we generate the data structures used for disambiguation. In my opinion if we have an entity in our store ( http://dbpedia.org/page/13th_century) which is perfectly matching with surface form in raw text (13th Century) we should have annotated SF to the entity. And same might be the case with Germany which is associated to History of Germany http://dbpedia.org/page/History_of_Germany not Germany http://dbpedia.org/page/Germany. In this case other factors might have crept in, in could be that Germany has a bigger number of inlinks or some other metric that allows it to overtake the most natural candidate. 2) We are spotting place and associating it with Portland Place http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF. And even Location (geography) http://dbpedia.org/page/Location_(geography) is not the correct entity type for this. This is because we are not able to detect the sense of the word place itself. So for that we may have to use word senses like from Wordnet etc. The sf spottling pipeline works a bit like this, you get a candidate SF, like 'Portland Place' and see if there's a candidate for that, but you also consider n-gram subparts, so it could have retrieved the candidates associated with place instead. I understand what you said but over here I wanted to point out that place is not even a noun and we are trying to associate it with an Named Entity which is a noun. 3) We are detecting . Berlin as a surface form. But I don't came to know where this SF comes from. And I suspect this SF doesn't come from the Wikipedia. Although . Berlin is highlighted, the entity is matched on Berlin, the extra space and punctuation comes from the way we tokenize sentences. We have chosen to use a language independent tokenizer using a break iterator for speed and language independence, but it hasn't been tested very well. This is the area which explains this mistake and help in it is much appreciated. Thanks for clarification. 4) We spotted capital of Germany but I didn't get any candidates if we run for candidates instead of annotate. This might be due to a default confidence score. If you pass the extra confidence param and set it to 0, you will probably see everything, e.g. /candidates/?confidence=0text= In fact, I suggest you to see all the candidates in the text you used to confirm (or not) what I've been saying here. I tried to do that but I still didn't get any Entity Candidate for capital of Germany. 5) We are able to spot 1920s as a surface form but not 1920. This is due to the generation /stemming of sfs we have been discussed, but I'm not sure that is a bad example. 1920 if used as a year might no mean the same as 1920s. This was my mistake. Few more questions: 1) Are we trying to annotate every word, noun or entity(e.g. proper noun) in raw text? Because in the above link I found documented (a word not a noun or entity) annotated to http://dbpedia.org/resource/Document . There are two main spotters, the default one that uses a finite state automaton generated from the surface form store to match incoming words as valid sequence of states (so in this sense everything goes through the pipeline), another that uses
Re: [Dbpedia-gsoc] Re-Introduction
Hi Felix, welcome back. We might add a few warm up tasks soon. Phillip's project might be merged on the dev branch soon and will provide a good basis for future additions. If you are interested in spotlight, are there any ideas on what aspects of it you'd like to concentrate on ? On Tue, Mar 1, 2016 at 12:36 PM, Marco Fossatiwrote: > Hey Felix, welcome back! > > Marco > > On 3/1/16 02:13, Felix Sonntag wrote: > > Hi everyone, > > > > I’m Felix, I already introduced myself last year, but I guess I’ll > shortly reintroduce myself. I’m a Master student in Informatics at TUM in > Munich. I’m pretty excited about the DBpedia project: I’m using Spotlight > for a project about analyzing artist data at the moment, and I’ve used > DBpedia data for an app last year. I already tried to participate in GSOC > with you last year, unfortunately it didn’t work out (apparently it was > really close :P). > > > > I’ve just finished my first Master semester with putting a focus on ML, > Data Analytics and NLP in my studies. > > > > There are some project ideas I’m keen on working, but I’ll directly post > on the project sites. > > > > One general question: for the Spotlight project there exists only a > rough idea by Philipp, and there are also no warm up tasks. Can we expect > more from that? :) > > > > Best, > > Felix > > > -- > > Site24x7 APM Insight: Get Deep Visibility into Application Performance > > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > > Monitor end-to-end web transactions and take corrective actions now > > Troubleshoot faster and improve end-user experience. Signup Now! > > http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140 > > ___ > > Dbpedia-gsoc mailing list > > Dbpedia-gsoc@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc > > > > > -- > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140 > ___ > Dbpedia-gsoc mailing list > Dbpedia-gsoc@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc > -- Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___ Dbpedia-gsoc mailing list Dbpedia-gsoc@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc