On 6/22/11 10:45 AM, Olivier Grisel wrote:
I wind the UIMA CAS API much more complicated to work with than
directly working with token-level concepts with the OpenNLP API (i.e.
with arrays of Span). I haven't add a look at the opennlp-uima
subproject though: you probably already have tooling and predefined
type systems that makes interoperability with CAS instance less of a
pain.
If you look at annotation tool they usually always give some flexibility
to the user
in terms what kind of annotations they are allowed to add. One thing I
always see is
as soon as they allow more complex annotations the tools and code which
handles to
annotations gets also complex. Have a look at Wordfreak or Gate.
The CAS might be difficult to use first, but at least it works and is
very well tested. If we create a custom solution we might end up with
a similar complexity anyway.
We would need to define a type system, but that is something we need
to do anyway independent of which way we implement it.
Maybe we even need to support different type systems for different corpora.
I guess we start with wikipedia based data, but one day we might want to
annotate an email or blog corpus.
It is an interesting question how the type system should look, since we
need to
track where the annotations come from, and might even want some to be
double checked,
or need to annotate the disagreement of annotators.
Jörn