Do you know how to filter tags like this one: {{date|November 13, 2004}} ?

The current implementation just turns it into {{date}}, but they must be either
be replaced by the date should just be removed.

I have the same issue for byline, etc.

The wikinews dump also contains pages which are not news articles,
I filter them now based on the availability of the {{publish}} tag, and then
cut the article after the text ends, based on the availability of headings,
or tags.

I changed the link handling a bit, because many links seem to be inter-wiki links
which the current implementation filters out.

Jörn

On 7/4/11 7:57 PM, Jörn Kottmann wrote:
On 7/4/11 7:41 PM, Olivier Grisel wrote:
2011/7/4 Jörn Kottmann<[email protected]>:
On 7/4/11 7:20 PM, Olivier Grisel wrote:
Keeping the correct link position from the original markup while
cleaning it can be tricky though. Be careful when tweaking the parser.
Maybe the Span helper classes from OpenNLP could help make this code
more robust.
I wonder how important the links are here, because we do not want to throw
away sentences which do not have links covering their entities.

But I believe the links might be very interesting for entity identification, if lets say a person name is labeled, and also covered by a link. The link
can be used to identify the person mention.
Yes this is exactly what pignlproc is doing. Building a NameFinder
training corpus automatically from the link position info from the
wikipedia articles and the entity typing info from the DBpedia dumps
(this articles is a person, this one is an organization....).

And after we have a few manually labeled articles we can use the links to
generate special features which are passed to the name finder.

So in the end, do we just generate an annotation for every link?!
This is very important to build a preannotated corpus to boostrap and
train a first version of OpenNLP models automatically. This model can
then be used to annotate new text without any annotations and human
refinement can then be used to produce gold annotations rapidly by
mostly validating / fixing existing annotations rather that annotating
text from scratch.

Links can also be useful to train a NE disambiguation training corpus.


The automatic labeling can be supported by features generated for the link
annotations, this way I guess the name finder performs much better, but
evaluation will show that.

Jörn


Reply via email to