On 13 April 2013 05:57, Shivani Poddar <[email protected]> wrote:
> On Fri, Apr 12, 2013 at 3:02 PM, Jimmy O'Regan <[email protected]> wrote:
>> If you're interested in information extraction of this kind, the good
>> news is that we have the data from the infoboxes, and that could be
>> used for semi-supervised creation of this kind of extraction template.
>> If your idea was based around something related to this, that could
>> make a great project.
>
> This does seem to cover a major part of my interest. Although my eventual
> goals (which are research based) would definitely look at the amalgam of the
> 3 concepts Pablo mentioned, but, as of now, for an immediate project, this
> seems very interesting to me. I would like to take it up for the coming
> summer.
> Also, by the creation for semi supervised template, would you mean a
> template for (say only) Hindi? Or would extending it for all languages be
> fine ?
>

The example I gave was deliberately basic, and can be achieved by
iterating through a list of properties and checking if the abstract
contains either the string (e.g., name) or a regex match (date). You
could either replace the occurrence with the property name (as in my
example), or surround it with XML-like tags, to be more suitable as
input to something like MinorThird
(http://teamcohen.github.io/MinorThird/).

That's simple enough that, with the caveat that the extraction
framework's date handling should be used (which will involve gaining
some small level of familiarity with that code), it could make a good
coding challenge for this idea.

That would be relatively language independent, for languages with
simple morphology (it might work for Hindi, but would probably not
work for Sanksrit), but would require a language processing pipeline
for more complex languages.

To be more dbpedia-specific, i.e., instead of the abstract text, using
the MediaWiki source text:
'''David Robert Joseph Beckham''' ([[Londres]], [[Ingalaterra]],
[[1975]]eko [[maiatzaren 2]]a) futbolari ingelesa da.
would make it more-or-less language independent[1], and would simplify
the matching of text to occurrences (though it would possibly make it
more difficult to prepare the text as input to something like
MinorThird).

In any case, the thing to bear in mind is that the same value may
appear with a number of attributes - Dublin is the largest city in
Ireland, as well as its capital; many kings were sons of their
predecessors, etc. - or even independent of them (i.e., the value may
appear in a sentence in a way that has nothing to do with any of the
possible attributes).

[1] The variation in the display text ([[dog|dogs]] or [[dog]]s) may
need to be handled to generalise the templates better, particularly
for complex languages, but I haven't given it too much thought.


-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to