Re: Corpus refinement tools (WAS: Board report status items)

Jörn Kottmann Wed, 22 Jun 2011 02:09:36 -0700

On 6/22/11 10:53 AM, Olivier Grisel wrote:

2011/6/22 Jörn Kottmann<[email protected]>:

On 6/22/11 10:27 AM, Olivier Grisel wrote:

Currently there is only a basic command line interface. I plan to work
on a SWING version too and Hannes started to work on a HTML /
Javascript frontend.

Did you have a look at the Cas Editor? It might have already many of
the features you need.

I had some time ago but AFAIK it does not focus on line by line,
keyboard based "binary" (true / false) validation and fast keyboard
based scanning of a large corpus.

Yes, it is not, but it can be extended to be like this. You would add anew view

which maybe shows a list of names you want to validate.

My use case if to bootstrap a OpenNLP model from the output an rough
incomplete extraction of Wikipedia with pignlproc as explained in [1],
use such (roughly) trained NameFinder model on new wikipedia text and
then use the refiner to validate the output of the NameFiner model so
as to be able to iterate later (potentially with a more active
learning strategy as in Dualist).


I will have another look at Cas Editor before embarking in the Swing
version though.

In your case you would need to cut your big amount of input text intomeaningful

pieces a user can work on. For wikipedia maybe an article level makes sense.
The user would see the entire article and you collect the input

with the new validation view. It would be easy to add good keyboardsupport to this view.

The Cas Editor is CAS based, but in your case it would be connected to aserver which

can provide the next CAS to label and can collect the labeled CASes.

In a corpus project I think it is nice to have very specialized toolingto collect annotationsvery efficient, but you also need tooling to view all the annotationsyou have collected. For the

later I think the Cas Editor could be a good tool.

The reason why I like a UIMA enabled approach so much is that ourinfrastructure can thenbe used by almost any annotation project and that will help us to getOpenNLP into more domains

maybe even into domains where training data will always be confidential.

Jörn

Re: Corpus refinement tools (WAS: Board report status items)

Reply via email to