Re: [Dbpedia-discussion] [Dbp-spotlight-developers] GSoC 2013 DBpedia + Spotlight joint proposal (please contribute within the next days)

Jimmy O'Regan Wed, 27 Mar 2013 09:49:54 -0700

On 27 March 2013 15:51, Jona Christopher Sahnwaldt <[email protected]> wrote:
> Hi Jimmy,
>
> thanks for your tips! I added/extended two ideas yesterday. I ended up
> at six to eight paragraphs with 400 to 500 words. Do you think that's
> too long? The 2012 ideas I looked at were shorter.


What I intended to say didn't come out quite as I meant :) -- more
information is, of course, better, but it shouldn't be a requirement.

At Apertium, the model we've settled on, over the last few years of
trial and error, is this:
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code - a
brief description, a rationale (why this is necessary), and a link to
a page that describes the problem in more depth. Last year, we didn't
have a page for each project, but one of the other guys had some spare
time this year.

As an example (this is more or less the project the student Sebastian
co-mentored with me was supposed to be working on):

Idea: Wrapper induction for Wiktionary

Description: Given an example of the data to be extracted, and the
source text to extract from, generate a template for use with the
Wiktionary module that is capable of extracting that data from the
source text.

Rationale: The various language editions of Wiktionary contain several
templates and layout conventions, often multiple templates per
language, which makes writing extraction templates impractical.

The corresponding page could then have:

* The python library scrapely features wrapper (template) induction
for HTML; this could be adapted to Mediawiki syntax.
* Grazer[1] uses existing knowledge to determine how to extract. There
are many existing resources (morphological dictionaries, pronunciation
dictionaries, WordNets, etc.) that provide some of the types of data
that could be used for this purpose. Conversely, information extracted
from Wiktionary could be used to (semi-)automatically generate RDF
converters for such resources.
* It may be desirable to expand nested templates. Many templates,
e.g., the Turkish inflection templates on en.wiktionary, are specified
in terms of other templates. These are often difficult to extract from
in themselves, while their parent simply generates a table. (Sweble is
reputed to be able to handle nested templates).

[1] Zhao, Shubin, and Jonathan Betz. "Corroborate and learn facts from
the web." Proceedings of the 13th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2007.


-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

------------------------------------------------------------------------------
Own the Future-Intel&reg; Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game 
on Steam. $5K grand prize plus 10 genre and skill prizes. 
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] [Dbp-spotlight-developers] GSoC 2013 DBpedia + Spotlight joint proposal (please contribute within the next days)

Reply via email to