Re: OpenNLP Annotations Proposal

Olivier Grisel Fri, 10 Jun 2011 07:13:49 -0700

Hi all,

Here is a short report of the Berlin Buzzwords Semantic / NLP
Hackathon that happened on Wednesday and yesterday at Neofonie and was
related to this corpus annotation project.


Basically we worked in small 2-3 people groups on various related topics.

Hannes introduced a HTML / JS based tool named Walter to visualize and
edit named entities and (optionally typed relations between those
entities). Demo is here:

  http://tmdemo.iais.fraunhofer.de/walter/

Currently Walter walks with UIMA / XMI formatted files as input /
output using a java servlet deployed on a tomcat server for instance.
The plan is to adapt it to a corpus annotation validation / refinement
pattern: feed it with a partially annotated corpus coming from the
output of a OpenNLP pre-trained on the annotations extracted from
Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
multilingual models.

We would like to make a fast binary interface with keyboard shortcuts
to focus one sentence at a time. If the user think that all the
entities in the sentence are correctly annotated by the model, he/she
press "space" and the sentence is marked validated and the focus moves
to the next sentence. If the sentence is complete gibberish he/she can
discard the sample by pressing "d". The user can also fix individual
annotations using the mouse interface before validating the corrected
sample.

Up arrow and down arrow allow the user to move to focus the previous
and next sentences (infinite AJAX / JSON scrolling over the corpus)
without validating / discarding the corpus.

When the focus is on a sample. The previous and next samples should be
displayed before and after with a lower opacity level in read-only
mode so as to provide the user with contextual information to make the
right decision on the active sample.

At the end of the session, the user can export all the validated
samples as a new corpus formatted using the OpenNLP format.
Unprocessed or explicitly discarded samples are not part of this
refined version of the annotated corpus.

To implement this we plan to rewrite the server side part of Walter in
two parts:

1- a set of JAX-RS resources to convert corpus items + their
annotations JSON objects on the client to / from OpenNLP NameSamples
on the server. The first embryon for this part is here:

  
https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web

2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
validation (with validation / discarding / update + previous and next
navigation) and serialization of the validated samples to a new
OpenNLP formatted file that can be fed to train a new generation of
the model. The work on this part has started here:

  https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner

Have a look at the test folder to see what's currently implemented. I
would like to keep this in a separate maven artifact to be able to
build a simple alternative CLI variant of the refiner interface that
does not require to start a jetty or tomcat instance  / browser.

For the client side, Hannes started to check that jquery should make
it easier to implement the ajax callbacks  based on mouse + keyboard
interaction.

As for the licensing, Hannes told me that his employer should be
willing to license the relevant parts (non specific to Fraunhoffer)
Walter under a liberal license (MIT, BSD or ASL) so that it should be
possible to contribute it to the ASF in the long term.

Another group tested DUALIST: the tool looks really nice for the text
classification case, less so for the NE detection case (the sample
view is not very well suited for structured output and it requires to
build Hearst features by hand, dualist does not do it automatically
apparently).

It should be possible to turn the Walter refiner into a real active
learning annotation for structured output (NE and relation extraction)
if we use the confidence level of the SequentialPerceptron of OpenNLP
and use the less confident predictions as priority samples for the
ordering of the sample to processing using the refined after pressing
"space" or "d". The server could incrementally used the refined sample
to update it's model and adjust the priority of the next batch of
samples to refine from time to time as the perceptron algorithm is
online (supports partial update of the model without restarting from
scratch).

Another group worked on named entity disambiguation using Solr
MoreLikeThisHandler and indexes of context occurrences of those
entities occurring in Wikipedia article. This work will probably be
integrated in Stanbol directly and should be less interesting for the
OpenNLP project. Also another group worked on adapting pignlproc to
their own tools and hadoop infrastructure.

Comments and pull-requests on the corpus-refiner prototype welcome. I
plan to go on working on this project from time to time. AFAIK Hannes
won't have time to work on the JS layer in the short term but it
should be at least possible to have a first version of the command
line based interface rather quickly.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Reply via email to