2011/6/22 Jörn Kottmann <[email protected]>: > On 6/22/11 6:50 PM, Olivier Grisel wrote: >> >> I am ok with switching to UIMA CAS. We might need additional metadata >> outside of the CAS annotations though. For instance if the annotators >> fixes a typo in the Sofa it-self, we might need to be able to tell >> that Sofa1 is subject to being replaced by Sofa2 according to >> annotator A1 for instance. >> > > I am not sure if we should fix such mistakes, the system will also encounter > them in real data it needs to process. Fixing typos, or correcting things in > the text is > always difficult when there are already existing annotations. > > Do you feel fixing mistakes in the text is important?
We can leave that issue as a low priority discussion for later and just ignore it for now. > We can also fix by having an option to delete "garbage" texts from the > corpus. Yes, discarding a whole CAS. But if the CAS is document level instead of sentence level, that might be an issue. > What other kind of data do you think we should store outside the CAses? If we ignore the Sofa editing use case, probably nothing. >> Also do you know of a good database for storing CAS? For instance does >> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe >> a JDCB CASConsumer + CollectionReader that we could use with Apache >> Derby for instance? > > I did a couple of tests with HBase and it was very easy to store 100M of > CASes, > anyway we do not really need to scale to that huge amounts, so I believe an > NoSQL or relational database would be just fine. I am -1 for HBase as it requires to setup a Hadoop cluster to run. As we target human annotators, we won't have terabytes of text data anyway and all data will probably fit in memory in most cases. I was thinking about using a DB to be able to handle concurrent editing by several annotators (+ ability to do search in the Sofa content) in a simple way. > To get started I believe we should just store a CAS as XMI and in a later > stage > we can work on optimizing the CAS storage to our needs and maybe even work > together with the UIMA team on a more general corpus server, I know several > people who have interest in this. Alright. Let's use plain XMI files parsed and loaded in memory at the beginning of annotation session. > I believe the Corpus server should be independent of the other components > and define some kind of remote API for data interchange. Is there a JSON version of XMI? Hannes, what is your opinion on this? > If we define such an API the actual storage system can be interchange easily > at a later point in time. Ok. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
