Re: OpenNLP Annotations Proposal

Jörn Kottmann Wed, 22 Jun 2011 10:46:23 -0700

On 6/22/11 6:50 PM, Olivier Grisel wrote:

I am ok with switching to UIMA CAS. We might need additional metadata
outside of the CAS annotations though. For instance if the annotators
fixes a typo in the Sofa it-self, we might need to be able to tell
that Sofa1 is subject to being replaced by Sofa2 according to
annotator A1 for instance.


I am not sure if we should fix such mistakes, the system will also encounter

them in real data it needs to process. Fixing typos, or correctingthings in the text is

always difficult when there are already existing annotations.

Do you feel fixing mistakes in the text is important?

We can also fix by having an option to delete "garbage" texts from thecorpus.


What other kind of data do you think we should store outside the CAses?

Also do you know of a good database for storing CAS? For instance does
there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
a JDCB CASConsumer + CollectionReader that we could use with Apache
Derby for instance?

I did a couple of tests with HBase and it was very easy to store 100M ofCASes,

anyway we do not really need to scale to that huge amounts, so I believe an
NoSQL or relational database would be just fine.

To get started I believe we should just store a CAS as XMI and in alater stage

we can work on optimizing the CAS storage to our needs and maybe even work
together with the UIMA team on a more general corpus server, I know several
people who have interest in this.

I believe the Corpus server should be independent of the other components
and define some kind of remote API for data interchange.

If we define such an API the actual storage system can be interchangeeasily at

a later point in time.

Jörn

Re: OpenNLP Annotations Proposal

Reply via email to