Hello and a happy new year to all,

thank you, Neal, for bringing this up.

Here are some use-cases from my personal experience:

- temporary storage to pass data between UIMA pipelines,
  e.g. one pipeline that does pre-processing, then N several
  pipelines which perform additional processing independently creating
  N different final results. This commonly happens when running
  parameter sweeping experiments.

- read/write working storage for a collaborative annotation system. Different
  users work on text annotation. This comes in two variants:
 
  a) every user has its own copy of the CAS

  b) all users work on the same CAS

  Binary CAS serialization is slightly problematic here because:

  - deleted annotations are not remove from the CAS

  - no new types can be added to the CAS

- read-only storage for annotated data in an application which supports
  searching and exploring annotated data, but does not make any changes
  to the data. In the case of CSniper [1], we query for sentences and then
  access the CAS to load the left and right context of the sentence 
  including annotations. We had to switch to using the binary CAS, because
  loading from XMI was too slow.

- archive storage for data that needs to be kept for historic reference
  such as results from experiments, results from manual annotation efforts,
  forensic results, etc. XMI and a type system stored as XML may still be
  the best option here.

> From what I've seen in the UIMA Oasis Spec Version 1.0, there isn't any
> discussion as to what would be a standard CAS Store.  If someone has more
> information on a UIMA backed store, please let me know.
> 
> Given  this interest, I was curious to ask the dev community:
> 
> What would you like to see in a CAS Store?  What kind of requirements have
> you had in your experience with UIMA, with respect to a CAS Store?

I would consider these as requirements

- FAST for reading and writing (like binary CAS serialization, 
  not like XMI)

- simple API

- embeddable in Java applications - no separate server or process

- physically remove deleted annotations from a CAS in the storage. At least
  when writing a new version of the CAS, possibly by addressing and deleting
  individual annotations directly in the storage. 

- allow changes to the type system

> Below is a list of requirements that I have gleaned from this board and my
> own experiences.  Please add or comment on what you think would be the most
> useful.  Please note that I'm not necessarily concerned with implementation
> (e.g., SQL vs NoSQL) at this time.
> 
>     1. Persist new CASes to the store

Delete CASes from the storage.

>     2. Query the store for a single CAS or a group of CASes
>     3. Query the store for a fragment  of a CAS (e.g., a sofa, view, or
> result)

I think storage and query are slightly different issues. If query is supported
it should probably be similar to the functionality supported by FSIndex. That
would have come in handy in CSniper as well for loading the sentence contexts.
I don't think any query facilities beyond that should be implemented such as
a Lucene full text search or a "semantic" search (cf. [2]).

>     4. Update stored CASes with new results from Analysis Operations -
> possibly the delta only
>     5. Provenance - This is one of our requirements where the ids of the
> CASes are maintained so as to provide evidence for our annotators after
> they've run on down stream analytics.
>     6. Universal identifiers for CASes.

There has been a discussion about this some time back, not to no avail. [3]

Following up on [3], it may be nice to allow storing some key/value meta-data
for the CAS which could carry information like a "name", date, author, etc.
and would be suiteable for display in an application before actually accessing
or loading the content of the CAS. If fast selective access to annotations in
CASes is possible, this might not be necessary, as a meta-data annotation
like SourceDocumentInformation within the CAS could be used. 

Cheers,

-- Richard

[1] http://aclweb.org/anthology-new/P/P12/P12-3015.pdf
[2] http://dl.acm.org/citation.cfm?id=1030325
[3] http://markmail.org/message/2zdg5um45x4uc6c6

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
[email protected] 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Reply via email to