On 6/22/11 6:50 PM, Olivier Grisel wrote:
I am ok with switching to UIMA CAS. We might need additional metadata
outside of the CAS annotations though. For instance if the annotators
fixes a typo in the Sofa it-self, we might need to be able to tell
that Sofa1 is subject to being replaced by Sofa2 according to
annotator A1 for instance.


I am not sure if we should fix such mistakes, the system will also encounter
them in real data it needs to process. Fixing typos, or correcting things in the text is
always difficult when there are already existing annotations.

Do you feel fixing mistakes in the text is important?

We can also fix by having an option to delete "garbage" texts from the corpus.

What other kind of data do you think we should store outside the CAses?
Also do you know of a good database for storing CAS? For instance does
there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
a JDCB CASConsumer + CollectionReader that we could use with Apache
Derby for instance?

I did a couple of tests with HBase and it was very easy to store 100M of CASes,
anyway we do not really need to scale to that huge amounts, so I believe an
NoSQL or relational database would be just fine.

To get started I believe we should just store a CAS as XMI and in a later stage
we can work on optimizing the CAS storage to our needs and maybe even work
together with the UIMA team on a more general corpus server, I know several
people who have interest in this.

I believe the Corpus server should be independent of the other components
and define some kind of remote API for data interchange.
If we define such an API the actual storage system can be interchange easily at
a later point in time.

Jörn

Reply via email to