Thank you all for you responses!

Based on what I've seen so far, I'd like to summarize what I think everyone
was getting at with the functionality they'd like to see in a CAS Store,
and then I'll ask more questions:

Uses:
    Maintain independence of annotations with respect to independent
pipelines
    Archival Storage
    Temporary storage between idependent pipelines
    Modify Type System
    Contain Meta Data about CAS
    Perform fine or coarse grained CRUD operations on a CAS


Functionality:
    Insert / Delete complete CAS(es)
    Insert / Delete fragments of CAS(es) (individual SOFAs or FSes
[annotations])
    Assemble CAS with SOFA and all / some / none feature structures
        - This would help reduce the size of CAS to its necessary
components before passing it to independent pipelines. It would also
require the construction of valid CASes for use in  Analytic Engines,
complete with valid Views

    Update CASes within the store (i.e, inserting annotations):
        - This would allow for adding deltas from AEs


Some of these might seem redundant, but I hope they give a general
overview.  Does this seem to summarize it well ?

Richard and Erik also mentioned a key point: whether they should be stored
as Serialized binary or XMI.  I understand that XMI is not as fast as
binary, but the binary seems to provide little flexibility in making for
fine grained queries (e.g. get all annotations of type Person), but the
speed is an issue.  What would the trade off look like between sending
fragements of CASes into a pipepline and reading in a full CAS from
serialized binary, or read / write operations of fragments of  CAS within
the Store?

Also as Richard mentioned, a key part about maintaining a store would be to
also maintain a reliably addressing system for indivual CASes and
annotations.  We are using OIDs  that we call Feature Structure ID (FSID).
Our current  FSIDs  are broken down into 4 levels:
CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a
collection of CASes that share some quality, the AnnotatorID identifies the
AE responsible for producing the annotations, and the Annotation ID
identifies the feature structure within a serialized CAS XMI. This provides
us the ability to perform fine / coarse grained CRUD operations within our
CAS store - we can insert/remove/update/delete an Annotation, create a CAS
from a list of FSIDS that only contain the feature structures that are
necessary for any Processing Element.

As for API and Java implementations -  would a JDBC be sufficient?





From:   Neal R Lewis/Almaden/IBM@IBMUS
To:     [email protected],
Date:   01/08/2013 11:38 AM
Subject:        Requirements / Wish List for CAS Store?





Hello All, and Happy New Year!

We've been working on our own  CAS Store for persisting CASes for our
analytics platform.  There has been interest in this topic recently,
specifically :

http://article.gmane.org/gmane.comp.apache.uima.devel/15292

Renaud discussed a module using MangoDB about a CAS Store:

http://article.gmane.org/gmane.comp.apache.uima.devel/15429

>From what I've seen in the UIMA Oasis Spec Version 1.0, there isn't any
discussion as to what would be a standard CAS Store.  If someone has more
information on a UIMA backed store, please let me know.

Given  this interest, I was curious to ask the dev community:

What would you like to see in a CAS Store?  What kind of requirements have
you had in your experience with UIMA, with respect to a CAS Store?

As was mentioned in the above threads, the impetus for a store seems to be
the need for a way to store CASes that will be used later by a different
analytic pipeline while still maintaining all CAS information.

Below is a list of requirements that I have gleaned from this board and my
own experiences.  Please add or comment on what you think would be the most
useful.  Please note that I'm not necessarily concerned with implementation
(e.g., SQL vs NoSQL) at this time.

    1. Persist new CASes to the store
    2. Query the store for a single CAS or a group of CASes
    3. Query the store for a fragment  of a CAS (e.g., a sofa, view, or
result)
    4. Update stored CASes with new results from Analysis Operations -
possibly the delta only
    5. Provenance - This is one of our requirements where the ids of the
CASes are maintained so as to provide evidence for our annotators after
they've run on down stream analytics.
    6. Universal identifiers for CASes.


I can go into more detail about the above, if anyone is interested.

Please let me know your thoughts!

Thanks!


Neal Lewis

Reply via email to