On 13 April 2013 05:57, Shivani Poddar <[email protected]> wrote: > On Fri, Apr 12, 2013 at 3:02 PM, Jimmy O'Regan <[email protected]> wrote: >> If you're interested in information extraction of this kind, the good >> news is that we have the data from the infoboxes, and that could be >> used for semi-supervised creation of this kind of extraction template. >> If your idea was based around something related to this, that could >> make a great project. > > This does seem to cover a major part of my interest. Although my eventual > goals (which are research based) would definitely look at the amalgam of the > 3 concepts Pablo mentioned, but, as of now, for an immediate project, this > seems very interesting to me. I would like to take it up for the coming > summer. > Also, by the creation for semi supervised template, would you mean a > template for (say only) Hindi? Or would extending it for all languages be > fine ? >
The example I gave was deliberately basic, and can be achieved by iterating through a list of properties and checking if the abstract contains either the string (e.g., name) or a regex match (date). You could either replace the occurrence with the property name (as in my example), or surround it with XML-like tags, to be more suitable as input to something like MinorThird (http://teamcohen.github.io/MinorThird/). That's simple enough that, with the caveat that the extraction framework's date handling should be used (which will involve gaining some small level of familiarity with that code), it could make a good coding challenge for this idea. That would be relatively language independent, for languages with simple morphology (it might work for Hindi, but would probably not work for Sanksrit), but would require a language processing pipeline for more complex languages. To be more dbpedia-specific, i.e., instead of the abstract text, using the MediaWiki source text: '''David Robert Joseph Beckham''' ([[Londres]], [[Ingalaterra]], [[1975]]eko [[maiatzaren 2]]a) futbolari ingelesa da. would make it more-or-less language independent[1], and would simplify the matching of text to occurrences (though it would possibly make it more difficult to prepare the text as input to something like MinorThird). In any case, the thing to bear in mind is that the same value may appear with a number of attributes - Dublin is the largest city in Ireland, as well as its capital; many kings were sons of their predecessors, etc. - or even independent of them (i.e., the value may appear in a sentence in a way that has nothing to do with any of the possible attributes). [1] The variation in the display text ([[dog|dogs]] or [[dog]]s) may need to be handled to generalise the templates better, particularly for complex languages, but I haven't given it too much thought. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Dbpedia-gsoc mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
