Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Olivier Grisel Wed, 19 Jan 2011 14:18:02 -0800

2011/1/19 Jörn Kottmann <[email protected]>:
> A while back I started thinking about if wikinews could be
> used as a training source as part of a community annotation
> project over at OpenNLP. I guess your experience and your code
> would be really helpful to transform that data into a format
> we could use for such a project. Over time we would pull in the
> new articles to keep up with new topics.


+1

Using wikinews instead of wikimarkup should require very little (or
even no) change to the existing sample scripts..

> In that annotation project we could introduce the concept of
> "atomic" annotations. That are annotations which are only considered as
> correct in a part of the article. Some named entity annotations could maybe
> directly
> created from the wiki markup with an approach similar to the one
> you used. And more could be produced by the community.
> I guess it is possible to give these partial available named entities to our
> name finder
> to automatically label the rest of the article with a higher precision than
> usual.

It's worth a try by need careful manual validation and evaluation of
the quality.

> After we manually labeled a few hundred articles with entities we could even
> go a step further and try to create new features for the name finder
> which take the wiki markup into account (such a name finder could also help
> your
> project to process the whole wikipedia).

Yes, it would be great to add new gazetteer features (names and
alternative spelling for famous entities such as persons, places,
organizations and so on) maybe in a compressed form using bloom
filters:

  http://en.wikipedia.org/wiki/Bloom_filter

AFAIK there are already existing implementations of bloom filters in
lucene, hadoop and cassandra.

As for building NameFinders that take the wikimarkup into account I am
not sure how it could help. Better get rid of hit as soon as possible
IMHO :)

> If we start something like that it might be only useful for the tokenizer,
> sentence
> detector and name finder in a short term. Maybe over time it is even
> possible to
> add annotations for all the components we have in OpenNLP into this corpus.
>
> What do others think ?

+1 overall

We also need user friendly tooling quickly review / validate / fix an
annotated corpus and fix it (rather than using vim or emacs).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Reply via email to