Any other opinions on how we should store/exchange our
text with annotations?
As proposed up to now:
1. UIMA CAS based approach
2. Custom solution as proposed by Olivier
I think we should reach consensus here quickly
so we can start extending the proposal.
And if there are no objections I suggest that we include
the Corpus Refiner in the proposal as a web based tool
to update/verify/annotate a corpus.
Jörn
On 6/22/11 11:38 AM, Olivier Grisel wrote:
2011/6/22 Jörn Kottmann<[email protected]>:
On 6/22/11 10:45 AM, Olivier Grisel wrote:
I wind the UIMA CAS API much more complicated to work with than
directly working with token-level concepts with the OpenNLP API (i.e.
with arrays of Span). I haven't add a look at the opennlp-uima
subproject though: you probably already have tooling and predefined
type systems that makes interoperability with CAS instance less of a
pain.
If you look at annotation tool they usually always give some flexibility to
the user
in terms what kind of annotations they are allowed to add. One thing I
always see is
as soon as they allow more complex annotations the tools and code which
handles to
annotations gets also complex. Have a look at Wordfreak or Gate.
The CAS might be difficult to use first, but at least it works and is
very well tested. If we create a custom solution we might end up with
a similar complexity anyway.
We would need to define a type system, but that is something we need
to do anyway independent of which way we implement it.
Maybe we even need to support different type systems for different corpora.
I guess we start with wikipedia based data, but one day we might want to
annotate an email or blog corpus.
It is an interesting question how the type system should look, since we need
to
track where the annotations come from, and might even want some to be double
checked,
or need to annotate the disagreement of annotators.
Point taken.