Hi all, Here is a short report of the Berlin Buzzwords Semantic / NLP Hackathon that happened on Wednesday and yesterday at Neofonie and was related to this corpus annotation project.
Basically we worked in small 2-3 people groups on various related topics. Hannes introduced a HTML / JS based tool named Walter to visualize and edit named entities and (optionally typed relations between those entities). Demo is here: http://tmdemo.iais.fraunhofer.de/walter/ Currently Walter walks with UIMA / XMI formatted files as input / output using a java servlet deployed on a tomcat server for instance. The plan is to adapt it to a corpus annotation validation / refinement pattern: feed it with a partially annotated corpus coming from the output of a OpenNLP pre-trained on the annotations extracted from Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap multilingual models. We would like to make a fast binary interface with keyboard shortcuts to focus one sentence at a time. If the user think that all the entities in the sentence are correctly annotated by the model, he/she press "space" and the sentence is marked validated and the focus moves to the next sentence. If the sentence is complete gibberish he/she can discard the sample by pressing "d". The user can also fix individual annotations using the mouse interface before validating the corrected sample. Up arrow and down arrow allow the user to move to focus the previous and next sentences (infinite AJAX / JSON scrolling over the corpus) without validating / discarding the corpus. When the focus is on a sample. The previous and next samples should be displayed before and after with a lower opacity level in read-only mode so as to provide the user with contextual information to make the right decision on the active sample. At the end of the session, the user can export all the validated samples as a new corpus formatted using the OpenNLP format. Unprocessed or explicitly discarded samples are not part of this refined version of the annotated corpus. To implement this we plan to rewrite the server side part of Walter in two parts: 1- a set of JAX-RS resources to convert corpus items + their annotations JSON objects on the client to / from OpenNLP NameSamples on the server. The first embryon for this part is here: https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative validation (with validation / discarding / update + previous and next navigation) and serialization of the validated samples to a new OpenNLP formatted file that can be fed to train a new generation of the model. The work on this part has started here: https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner Have a look at the test folder to see what's currently implemented. I would like to keep this in a separate maven artifact to be able to build a simple alternative CLI variant of the refiner interface that does not require to start a jetty or tomcat instance / browser. For the client side, Hannes started to check that jquery should make it easier to implement the ajax callbacks based on mouse + keyboard interaction. As for the licensing, Hannes told me that his employer should be willing to license the relevant parts (non specific to Fraunhoffer) Walter under a liberal license (MIT, BSD or ASL) so that it should be possible to contribute it to the ASF in the long term. Another group tested DUALIST: the tool looks really nice for the text classification case, less so for the NE detection case (the sample view is not very well suited for structured output and it requires to build Hearst features by hand, dualist does not do it automatically apparently). It should be possible to turn the Walter refiner into a real active learning annotation for structured output (NE and relation extraction) if we use the confidence level of the SequentialPerceptron of OpenNLP and use the less confident predictions as priority samples for the ordering of the sample to processing using the refined after pressing "space" or "d". The server could incrementally used the refined sample to update it's model and adjust the priority of the next batch of samples to refine from time to time as the perceptron algorithm is online (supports partial update of the model without restarting from scratch). Another group worked on named entity disambiguation using Solr MoreLikeThisHandler and indexes of context occurrences of those entities occurring in Wikipedia article. This work will probably be integrated in Stanbol directly and should be less interesting for the OpenNLP project. Also another group worked on adapting pignlproc to their own tools and hadoop infrastructure. Comments and pull-requests on the corpus-refiner prototype welcome. I plan to go on working on this project from time to time. AFAIK Hannes won't have time to work on the JS layer in the short term but it should be at least possible to have a first version of the command line based interface rather quickly. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
