On Thu, Nov 17, 2011 at 11:48 AM, Jörn Kottmann <[email protected]> wrote:
> On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote: > >> We shouldn't replace JWNL with a newer version, >>> because we currently don't have the ability to train >>> or evaluate the coref component. >>> >>> +1. Having tests coverage eases many things, refactoring and development >> included :) >> >> This is a big issue for us because that also blocks >> >>> other changes and updates to the code itself, >>> e.g. the cleanups Aliaksandr contributed. >>> >>> What we need here is a plan how we can get the coref component >>> into a state which makes it possible to develop it in a community. >>> >>> If we don't find a way to resolve this I think we should move the coref >>> stuff >>> to the sandbox and leave it there until we have some training data. >>> >>> In my experience doing things like this is almost equal to deleting the >> piece of code altogether. On the other side, if there is no developer, >> actively using and developing this piece, having corpora, tests, etc, >> others might not have enough incentives. >> > > That is already the situation the developer who wrote doesn't support it > anymore. > The only way to get it alive again would be to get the training and > evaluation running. > If we have that, it will be possible to continue to work on it, and people > can start using > it. The code itself is easy to understand and I have a good idea of how it > works. > > In the current state it really blocks the development of a few things. > > >> Another option would be label enough wikinews data, so we are able to >> train it. >> >> How much exactly is this "enough"? And what's the annotation UI? This also >> might be a good option to improve the annotation tools. I might be >> interested in pursuing this option (only if the corpus produced will be >> under a free license), mainly to learn :) but I would need some help and >> supervision. >> > > We are discussing to do a wikinews crowd sourcing project to label > training data for all components in OpenNLP. > > I once wrote a proposal to communicate this idea: > https://cwiki.apache.org/**OPENNLP/opennlp-annotations.**html<https://cwiki.apache.org/OPENNLP/opennlp-annotations.html> > > Currently we have a first version of the Corpus Server and plugins > for the UIMA Cas Editor (an annotation tool) to access articles in the > Corpus Server and > an OpenNLP Plugin which can help with doing sentence detection, > tokenization and NER (could be extended with support for coref). > > These tools are all located in the sandbox. > > I am currently using them to run a private annotation project, and > therefore have time to work on them. I'll get a look at them. I also have my own annotation tools, because I wasn't happy with what was available out there few years ago and because of some specifics of the situation which can be exploited to speed up the annotation, but I would be happy to avoid duplication. Aliaksandr
