On 6/22/11 8:13 PM, Hannes Korte wrote:
On 06/22/2011 07:53 PM, Olivier Grisel wrote:
2011/6/22 Jörn Kottmann<[email protected]>:
On 6/22/11 6:50 PM, Olivier Grisel wrote:
I am ok with switching to UIMA CAS. We might need additional metadata
outside of the CAS annotations though. For instance if the annotators
fixes a typo in the Sofa it-self, we might need to be able to tell
that Sofa1 is subject to being replaced by Sofa2 according to
annotator A1 for instance.
I am not sure if we should fix such mistakes, the system will also encounter
them in real data it needs to process. Fixing typos, or correcting things in
the text is
always difficult when there are already existing annotations.
Do you feel fixing mistakes in the text is important?
We can leave that issue as a low priority discussion for later and
just ignore it for now.
We can also fix by having an option to delete "garbage" texts from the
corpus.
Yes, discarding a whole CAS. But if the CAS is document level instead
of sentence level, that might be an issue.
Let's say we have a CAS type Sentence, which will not be changed, and
another type AnnotatedSentence. Each time a sentence was annotated by a
user, a new AnnotatedSentence annotation will be created in the same
span containing information about the user and the state of the sentence
(e.g. correct, unsure, or discarded). This way we can store all that
without the need for changes to the Sofa. Alternatively, each Sentence
could have a List of something like AnnotationMetadata.
The only reason to change a sofa is, when the user wants to change the text
itself, right? How would the AnnotatedSentence annotation do that?
Would it just store the changed text a string feature?
I believe the Corpus server should be independent of the other components
and define some kind of remote API for data interchange.
Is there a JSON version of XMI? Hannes, what is your opinion on this?
A separate corpus server sounds good to me. But this server can simply
deliver the default XMI representation of the CASes. I think the
documents have to be preprocessed for annotation on the server side of
the WebGUI anyways. The JS client should not call the corpus server
directly.
+1
Jörn