2011/6/22 Jörn Kottmann <[email protected]>:

> I was actually thinking about something similar. Make a small server which
> can host XMI CAS files. CASes have the advantage that they take a way lots
> of complexity when dealing with a text and annotations.
>
> Since we have an UIMA Integration OpenNLP can directly be trained with the
> CASes, in this case we would make a small server component which can do
> the training and then makes the models available via http for example.
>
> It sounds like that a corpus refiner based web ui could be easily attached
> to such a server, and also other tools like the Cas Editor.

I wind the UIMA CAS API much more complicated to work with than
directly working with token-level concepts with the OpenNLP API (i.e.
with arrays of Span). I haven't add a look at the opennlp-uima
subproject though: you probably already have tooling and predefined
type systems that makes interoperability with CAS instance less of a
pain.

> To pre-annotate the articles, we might want to add different types of name
> annotations
>
>> We would like to make a fast binary interface with keyboard shortcuts
>> to focus one sentence at a time. If the user think that all the
>> entities in the sentence are correctly annotated by the model, he/she
>> press "space" and the sentence is marked validated and the focus moves
>> to the next sentence. If the sentence is complete gibberish he/she can
>> discard the sample by pressing "d". The user can also fix individual
>> annotations using the mouse interface before validating the corrected
>> sample.
>>
> Did you discuss to focus on a sentence level? This solution would still
> requires
> that one annotator goes through the entire document. Maybe we have a user
> who wants to fix our wikinews model to detect his entity of choice. Then he
> might want to search for sentences which contain it and only label these.

Adding a keyword filter / search would be very interesting indeed.

> Working on a sentence level also has the advantage that a user can skip a
> sentence
> which contains an entity he is no sure about how it should be labeled.

Yes.

> Did you think of using GWT, it might be a very good fit for OpenNLP bacause
> all here
> have a lot of experience with Java, but maybe not so much experience with
> JS?

In my experience the GWT abstraction layer adds more complexity than
anything else when dealing with lowlevel DOM related concepts such as
introducing new "span" elements around a mouse selection.

I much prefer debugging in JS using libraries such as JQuery and the
firebug debugger even though I am not an experienced JS programmer as
well.

Furthermore Hannes already had a working code base.

> Entity disambiguation would be very nice to have in OpenNLP and I also
> need to work on that soon.

I will (soon?) include a couple of new scripts in pignlproc to extract
occurrence contexts of any kind of entities occurring as wikilinks in
Wikipedia dumps to load those in a Solr index. I will let you know
when that happens.

>> Comments and pull-requests on the corpus-refiner prototype welcome. I
>> plan to go on working on this project from time to time. AFAIK Hannes
>> won't have time to work on the JS layer in the short term but it
>> should be at least possible to have a first version of the command
>> line based interface rather quickly.
>
> Yes, it would be nice to have such a tool, but for OpenNLP Annotations it
> must be more focused on crowd sourcing and to work well with a small /
> medium size group
> of people.

I agree. The CLI (& Swing) interface is still useful to validate the
workflow concepts though.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to