Hi Cristian,

The approach sounds nice. I don't want to be the devil's advocate but I'm just not sure about the recall using the dbpedia categories feature. For example, your sentence could be also "Microsoft posted its 2013 earnings. The Redmond's company made a huge profit". So, maybe including more contextual information from dbpedia could increase the recall but of course will reduce the precision.

Cheers,
Rafa

El 04/02/14 09:50, Cristian Petroaca escribió:
Back with a more detailed description of the steps for making this kind of
coreference work.

I will be using references to the following text in the steps below in
order to make things clearer : "Microsoft posted its 2013 earnings. The
software company made a huge profit."

1. For every noun phrase in the text which has :
     a. a determinate pos which implies reference to an entity local to the
text, such as "the, this, these") but not "another, every", etc which
implies a reference to an entity outside of the text.
     b. having at least another noun aside from the main required noun which
further describes it. For example I will not count "The company" as being a
legitimate candidate since this could create a lot of false positives by
considering the double meaning of some words such as "in the company of
good people".
"The software company" is a good candidate since we also have "software".

2. match the nouns in the noun phrase to the contents of the dbpedia
categories of each named entity found prior to the location of the noun
phrase in the text.
The dbpedia categories are in the following format (for Microsoft for
example) : "Software companies of the United States".
  So we try to match "software company" with that.
First, as you can see, the main noun in the dbpedia category has a plural
form and it's the same for all categories which I saw. I don't know if
there's an easier way to do this but I thought of applying a lemmatizer on
the category and the noun phrase in order for them to have a common
denominator.This also works if the noun phrase itself has a plural form.

Second, I'll need to use for comparison only the words in the category
which are themselves nouns and not prepositions or determiners such as "of
the".This means that I need to pos tag the categories contents as well.
I was thinking of running the pos and lemma on the dbpedia categories when
building the dbpedia backed entity hub and storing them for later use - I
don't know how feasible this is at the moment.

After this I can compare each noun in the noun phrase with the equivalent
nouns in the categories and based on the number of matches I can create a
confidence level.

3. match the noun of the noun phrase with the rdf:type from dbpedia of the
named entity. If this matches increase the confidence level.

4. If there are multiple named entities which can match a certain noun
phrase then link the noun phrase with the closest named entity prior to it
in the text.

What do you think?

Cristian

2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>:

Hi Rafa,

I don't yet have a concrete heursitic but I'm working on it. I'll provide
it here so that you guys can give me a feedback on it.

What are "locality" features?

I looked at Bart and other coref tools such as ArkRef and CherryPicker and
they don't provide such a coreference.

Cristian


2014-01-30 Rafa Haro <rh...@apache.org>:

Hi Cristian,
Without having more details about your concrete heuristic, in my honest
opinion, such approach could produce a lot of false positives. I don't know
if you are planning to use some "locality" features to detect such
coreferences but you need to take into account that it is quite usual that
coreferenced mentions can occurs even in different paragraphs. Although I'm
not an expert in Natural Language Understanding, I would say it is quite
difficult to get decent precision/recall rates for coreferencing using
fixed rules. Maybe you can give a try to others tools like BART (
http://www.bart-coref.org/).

Cheers,
Rafa Haro

El 30/01/14 10:33, Cristian Petroaca escribió:

  Hi,
One of the necessary steps for implementing the Event extraction Engine
feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to have
coreference resolution in the given text. This is provided now via the
stanford-nlp project but as far as I saw this module is performing mostly
pronomial (He, She) or nominal (Barack Obama and Mr. Obama) coreference
resolution.

In order to get more coreferences from the text I though of creating some
logic that would detect this kind of coreference :
"Apple reaches new profit heights. The software company just announced
its
2013 earnings."
Here "The software company" obviously refers to "Apple".
So I'd like to detect coreferences of Named Entities which are of the
rdf:type of the Named Entity , in this case "company" and also have
attributes which can be found in the dbpedia categories of the named
entity, in this case "software".

The detection of coreferences such as "The software company" in the text
would also be done by either using the new Pos Tag Based Phrase
extraction
Engine (noun phrases) or by using a dependency tree of the sentence and
picking up only subjects or objects.

At this point I'd like to know if this kind of logic would be useful as a
separate Enhancement Engine (in case the precision and recall are good
enough) in Stanbol?

Thanks,
Cristian



Reply via email to