Hello,
I am also interested in working in the project about fact extraction from
wikipedia text, I would like to ask for some clarifications about the
machine learning part of it. The core of the project is to train a
classifier using a training set built following the approaches described in
the linked papers. As I understood it, the following tasks are needed;
given a sentence

 1a. Identify all the LUs using NLP techniques;
 2b. Identify all the entities in the sentence which may represent FEs
using again NLP techniques (ASRL perhaps?)
 2. Use the FrameNet definition for the identified LUs to find the required
FEs;
 3. Ask the user whether a certain entity fits a certain FE (for all
entities and FEs);
 4. Understand which is the correct LU based on the meanings given in step
(3).

In the linked papers few is mentioned about steps (1a) and (1b) (but
clarification has already been asked for), step (2) is straightforward and
step (4) has already been implemented, the classifier is needed for step
(3). Thus, it has to answers questions such as "can this entity be this
FE?" or "is this entity this FE in this context?" (the latter being a lot
harder in my opinion). It is not clear to me, though, which features should
be used to train this classifier.

Frequently, in text classification, there is an one-to-one mapping between
words and features; in this case  FEs have to be used instead of words
(FrameNet currently recognizes slightly more than 10k FEs). There is also a
need for features identifying the possible entities, but clearly we cannot
use the whole DBpedia knowledge base (roughly 4.6 million entities) for
this. I see that FEs belonging to a frame are usually of different types,
so I think using *classes* instead of *instances* could be a promising
alternative (DBpedia has 685 classes). Probably other features are needed
though.

Sorry for the long wall of text, I tried to express my thoughts in the
shortest way I could. What do you think?

Emilio.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to