On 6/22/11 10:53 AM, Olivier Grisel wrote:
2011/6/22 Jörn Kottmann<[email protected]>:
On 6/22/11 10:27 AM, Olivier Grisel wrote:
Currently there is only a basic command line interface. I plan to work
on a SWING version too and Hannes started to work on a HTML /
Javascript frontend.
Did you have a look at the Cas Editor? It might have already many of
the features you need.
I had some time ago but AFAIK it does not focus on line by line,
keyboard based "binary" (true / false) validation and fast keyboard
based scanning of a large corpus.
Yes, it is not, but it can be extended to be like this. You would add a new view
which maybe shows a list of names you want to validate.
My use case if to bootstrap a OpenNLP model from the output an rough
incomplete extraction of Wikipedia with pignlproc as explained in [1],
use such (roughly) trained NameFinder model on new wikipedia text and
then use the refiner to validate the output of the NameFiner model so
as to be able to iterate later (potentially with a more active
learning strategy as in Dualist).

I will have another look at Cas Editor before embarking in the Swing
version though.
In your case you would need to cut your big amount of input text into meaningful
pieces a user can work on. For wikipedia maybe an article level makes sense.
The user would see the entire article and you collect the input
with the new validation view. It would be easy to add good keyboard support to this view.

The Cas Editor is CAS based, but in your case it would be connected to a server which
can provide the next CAS to label and can collect the labeled CASes.

In a corpus project I think it is nice to have very specialized tooling to collect annotations very efficient, but you also need tooling to view all the annotations you have collected. For the
later I think the Cas Editor could be a good tool.

The reason why I like a UIMA enabled approach so much is that our infrastructure can then be used by almost any annotation project and that will help us to get OpenNLP into more domains
maybe even into domains where training data will always be confidential.

Jörn

Reply via email to