> We shouldn't replace JWNL with a newer version,
> because we currently don't have the ability to train
> or evaluate the coref component.
>
+1. Having tests coverage eases many things, refactoring and development
included :)

This is a big issue for us because that also blocks
> other changes and updates to the code itself,
> e.g. the cleanups Aliaksandr contributed.
>
> What we need here is a plan how we can get the coref component
> into a state which makes it possible to develop it in a community.
>
> If we don't find a way to resolve this I think we should move the coref
> stuff
> to the sandbox and leave it there until we have some training data.
>
In my experience doing things like this is almost equal to deleting the
piece of code altogether. On the other side, if there is no developer,
actively using and developing this piece, having corpora, tests, etc,
others might not have enough incentives.

Don't having the ability to train coref also blocks changes we might want
> to do the our maxent library.
>
> Maybe it is possible to buy a license for MUC 6 and 7 data, so we can share
> this data privately by the team. Are any people familiar if that would be
> possible
> with the LDC license?
>
> The CONLL2011 data (OntoNotes, costs 50$) might also be suitable to train
> it:
> http://conll.bbn.com/index.**php/data.html<http://conll.bbn.com/index.php/data.html>
>
> Another option would be label enough wikinews data, so we are able to
> train it.
>
How much exactly is this "enough"? And what's the annotation UI? This also
might be a good option to improve the annotation tools. I might be
interested in pursuing this option (only if the corpus produced will be
under a free license), mainly to learn :) but I would need some help and
supervision.

Aliaksandr

Reply via email to