> Based on what I've seen so far, I'd like to summarize what I think everyone 
> was getting at with the functionality they'd like to see in a CAS Store, and 
> then I'll ask more questions:
> 
> Uses:
>     Maintain independence of annotations with respect to independent pipelines
>     Archival Storage
>     Temporary storage between idependent pipelines 
>     Modify Type System 
>     Contain Meta Data about CAS 
>     Perform fine or coarse grained CRUD operations on a CAS 
>         
> 
> Functionality: 
>     Insert / Delete complete CAS(es)
>     Insert / Delete fragments of CAS(es) (individual SOFAs or FSes 
> [annotations])
>     Assemble CAS with SOFA and all / some / none feature structures 
>         - This would help reduce the size of CAS to its necessary components 
> before passing it to independent pipelines. It would also require the 
> construction of valid CASes for use in  Analytic Engines, complete with valid 
> Views
>          
>     Update CASes within the store (i.e, inserting annotations):
>         - This would allow for adding deltas from AEs 
> 
> Some of these might seem redundant, but I hope they give a general overview.  
> Does this seem to summarize it well ?

The U from CRUD is missing in the functionalities list: an ability to update 
existing FSes, e.g. changing a feature value, seems to be missing. I believe 
this is also not supported by the current delta-CAS support in UIMA.

Other than that it appears like a good set to start with.

> Richard and Erik also mentioned a key point: whether they should be stored as 
> Serialized binary or XMI.  I understand that XMI is not as fast as binary, 
> but the binary seems to provide little flexibility in making for fine grained 
> queries (e.g. get all annotations of type Person), but the speed is an issue. 
>  What would the trade off look like between sending fragements of CASes into 
> a pipepline and reading in a full CAS from serialized binary, or read / write 
> operations of fragments of  CAS within the Store?   

The way you put it, it appears that XMI provides for fine grained queries while 
the binary CAS does not. However, there is no support for fine grained access 
in either formats (deliberately ignoring that XMI is an XML format and could be 
stored in an XML database providing for fine-grained access).

As far as I understand it, the binary CAS format would even be more suitable to 
implement a fine-grained access, as it could be memory-mapped and annotations 
could be directly accessed on disk (random access) in very much the same way as 
the heaps of the in-memory CAS are accessed. Instead of a single file, one file 
per heap and per index may be more suitable, though. It might be possible, if 
not even straight-forward, to do a file-based implementation of the LowLevelCAS 
interface which uses memory-mapping. The in-memory implementation of the 
LowLevelCAS has been folded into CASImpl instead of being a separate class that 
could be plugged into CASImpl, which would provide for different low-level 
storage backends of CASImpl. Somebody more familiar with that part of the code 
would have to comment if refactoring out the LowLevelCas implementation form 
CASImpl and making it pluggable would be feasible or not.

The main problem of the binary CAS is that the type system is locked and cannot 
be changed. It shares this fact with the in-memory CAS, in which the type 
system likewise cannot be augmented after the CAS has been created and locked. 
Marshal suggested some time back to relax this and allow compatible changes to 
the type system to be made (new types and features) even after the creation of 
the CAS.

A secondary problem of the binary CAS is that deleted annotations do not get 
physically removed, the are only removed from the indexes. This is another fact 
it shares with the in-memory CAS. Some form of garbage collection and IDs that 
remain stable across garbage collection runs would be useful, even for the 
in-memory CAS.

The main problem of XMI is that it is slow. 

A secondary problem is that it, being an XML format, does not support certain 
characters. Another secondary problem is that random access to data in the XMI 
file is not possible - it must be streamed.

> Also as Richard mentioned, a key part about maintaining a store would be to 
> also maintain a reliably addressing system for indivual CASes and 
> annotations.  We are using OIDs  that we call Feature Structure ID (FSID). 
> Our current  FSIDs  are broken down into 4 levels: 
> CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a 
> collection of CASes that share some quality, the AnnotatorID identifies the 
> AE responsible for producing the annotations, and the Annotation ID 
> identifies the feature structure within a serialized CAS XMI. This provides 
> us the ability to perform fine / coarse grained CRUD operations within our 
> CAS store - we can insert/remove/update/delete an Annotation, create a CAS 
> from a list of FSIDS that only contain the feature structures that are 
> necessary for any Processing Element. 

The FSID format you suggest has many semantics. I think an ID for a CAS and an 
ID which reliably identifies an FS with a CAS would be sufficient to start 
with. Both IDs should be independent with no particular defined concatenation 
scheme. If the other levels of IDs are required, it should be up to an 
application to define and use them. For example, I might want to use an URL as 
collection ID or not have one at all. The ID as you put it seems to imply that 
there is a special relation between an annotator and the annotations it 
generates, which may or may not be true or desired.

Are there additional requirements hidden in the FSID format? I could imagine:

- ability to get all FSes produced by a certain annotator across all CASes in 
all collections or in a certain collection

- ability to get all CASes in a collection

- ability to get all CASes

Where do you get the annotatorId from? I see no sensible way that the UIMA 
framework can provide such an ID. There is also a conflict potential. Consider 
if analysis engine A creates an FS and analysis engine B updates a primitive 
feature in that FS. Assuming that primitives do not get an FSID since they are 
not FSes, should the annotatorID of the FS be updated to B or should it remain 
A?

> As for API and Java implementations -  would a JDBC be sufficient? 

So far I have used JDBC only with SQL databases. I don't believe it's API is 
well suited for dealing with CASes and FSes. E.g. a JDBC result set resembles 
tuples from a table, but when we work with FSes, we actually operate on an 
object graph. So the CAS provides more of an JPA-like access mechanism than a 
JDBC-like access mechanism. It appears to be that the JDBC offers much more 
functionality that would be required for a CAS store. How about storage-backed 
implementations of FSIndex and maybe of CAS itself. JDBC also has the issue 
that the query strings are oblique to the compiler and to IDE's, meaning: no 
type safety and no refactoring support.

A while back, we discussed alternative access methods for the CAS in the 
uimaFIT project. uimaFIT provides convenience methods to access the in-memory 
CAS. Consider this:

for (Token t : JCasUtil.select(jcas, Token.class)) {
 …
}

The uimaFIT API currently doesn't support predicates on feature for example. We 
considered the UIMA Constraint API to complex to use and came up with an 
"SQLCAS" approach (due to some resemblance to SQL notation, not due to an SQL 
backend or JDBC being used) - it should still have been Java and type save. 
Steven Bethard did a prototype implementation which supports something like 
this:

    DocumentAnnotation document = 
CasQuery.from(this.jCas).select(DocumentAnnotation.class).single();
    Iterator<Sentence> sentences = 
CasQuery.from(this.jCas).select(Sentence.class).iterator();
    Collection<Token> tokens = 
CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence);
    Token token = 
CasQuery.from(this.jCas).select(Token.class).matching(annotation).single();
    Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne();

More discussion on this topic, different approaches/syntaxes, and a patch for 
uimaFIT can be found in the uimaFIT issue tracker [1].

If implementing an existing standard API was a requirement, JPA 2.0 (in 
particular the criteria API) would probably provide a better level of 
abstraction than JDBC. JDO might be another (possibly better) alternative [2]. 
So far, I only had a very brief rendezvous with JDO on the Google App Engine 
and quickly dropped it again in favor of JPA because I found the latter to more 
suitable for dependency injection frameworks (@PersistenceContext annotation 
can be used in inject an EntityManager into a class, no equivalent annotation 
for JDO). JPA seems to get more "love" from tools and vendors, but JDO might be 
conceptually a better fit, cf. the following comment on IBM developer works [3] 
and other interesting comments in the same thread:

  PinakiPoddar commented Jan 23 2011:

  I had the unique honor of participating in both JDO and 
  JPA Expert group. It is unique because the two groups have
  little overlap. The answers to your question on JDO or JPA
  -- the critical difference is about datastore technology
  these two specifications aim to support. JPA is limited to 
  only relational database as data storage, whereas JDO is
  agnostic to data store technology. 
 
  The unequal prominence of these two specifications that have
  similar goals reflects the prevalence of relational database
  as a storage of data as compared to other non-relational
  storage mechanics. However, resurgence of interest in
  non-relational storage systems such as NoSQL may highlight
  the importance of JDO's original aim to support multiple
  data storage technologies.

To relativize this, JPA is also supported on Google's App Engine datastore 
which is backed by the BigTable NoSQL storage engine. Also the CAS is quite 
similar in structure to a RDBMS. 

Cheers,

-- Richard


[1] https://code.google.com/p/uimafit/issues/detail?id=65
[2] http://db.apache.org/jdo/jdo_v_jpa.html
[3] 
https://www.ibm.com/developerworks/mydeveloperworks/blogs/jdo/entry/december_10_2010_2_12_am38?lang=en_us#comment-1295782222672

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
[email protected] 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Reply via email to