Re: Requirements / Wish List for CAS Store?

Neal R Lewis Mon, 14 Jan 2013 10:14:51 -0800

>The way you put it, it appears that XMI provides for fine grained queries 
>while the binary CAS does not. However, there is no support for fine grained 
>access in either formats (deliberately ignoring that XMI is an XML format and 
>could be stored in an XML database providing for fine-grained access).

> Are there additional requirements hidden in the FSID format? I could imagine:

> - ability to get all FSes produced by a certain annotator across all CASes in 
> all collections or in a certain collection

> - ability to get all CASes in a collection

> - ability to get all CASes

> Where do you get the annotatorId from? I see no sensible way that the UIMA 
> framework can provide such an ID. There is also a conflict potential. 
> Consider if analysis engine A creates an FS and analysis engine B updates a 
> primitive feature in that FS. Assuming that primitives do not get an FSID 
> since they are not FSes, should the annotatorID of the FS be updated to B or 
> should it remain A?

We currently assign our own annotator IDs.  Annotator IDs are meant to 
distinguish between different  UIMA applications, not individual AEs in an 
aggregrate or within the same pipeline.  I meant to use term AnalyticID, which 
is more precise, so I'll use that from now on.  In the strictist sense, you can 
imagine different Analytic IDS between different PEARs written by separated 
developers for separate annoators, but run along the same CAS.  Now, a conflict 
might occur if they both had the same Type System and annotated the same CAS. 
This would result in a duplicate annoation for that type, but not for a FS 
because the FS would have a different FSID associated with.

So, if only one uima application is performing an application, then the 
AnalyticID can remain stable throughout operations, perhaps with a default 
value if only one uima application is ran. 

A benefit of having a CAS store is more than archiving information, but to 
track a CAS's trajectory through analytics. It allows analytics to be developed 
disparitly.  When a new analytic is developed (and here, I mean a new PEAR that 
will most likely run in a new JVM), we can run it on a old collection of CASes. 
 If an analytic is small and only needs one or two objects from the CAS, then 
we can reduce message size by retrieving only those objects which are 
necessary.  FSIDs and XMI serialization allow us to do this.

With FSIDs, there is a possibility to query for a particular element or groups 
of elements, or even multiple CASes.  Some queries are more complex than others 
(like getting all annotations from a particular annotator across CASes) but 
still manageable.  I'll try to illustrate with an example from a hypothetical 
deserialized CAS:

The following CAS was queried for an fsid with Collection ID of 3, artifcat ID 
of 15, and AnalyticID of 6000.  The 6000 analytic looked for sentences like 
"lvef of 30%" and annotated for the sentence and value of the lvef (ignore for 
now the cop element, that is a an element we use to track provenance):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI"; xmlns:cas="http:///uima/cas.ecore"; 
xmlns:cdts="http:///org/test/health/cdts.ecore"; 
xmlns:tcas="http:///uima/tcas.ecore"; xmi:version="2.0">
<cdts:LVEF begin="449" value="20-25%" cop=".3.15.1000.1" end="490" 
fsid=".3.15.6000.1" sofa="1" xmi:id="13"/>
</xmi:XMI>

This Cas Fragment is given what we call a "transient View" or projection during 
a preprocessing step before running through a PEAR in UIMAj (which we externall 
assign an AnalyticID of 6003) that will look for the value of the LVEF , map it 
to a term, filter for only new objects in the CAS,  and then put back into the 
Store, where the store writes in a new fsid:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI"; xmlns:cas="http:///uima/cas.ecore"; 
xmlns:cdts="http:///org/test/health/cdts.ecore"; 
xmlns:tcas="http:///uima/tcas.ecore"; xmi:version="2.0">
<cdts:SeverelyDepressedLVEF begin="449" time="2001-12-31T12:00:00" 
value="20-25%" cop=".3.15.6000.1" end="490" fsid=".3.15.6003.1" sofa="1" 
xmi:id="13"/>
</xmi:XMI>

This is what I mean by fine grained queries using FSID.   We can also image a 
coarse query for a Collection 3, CAS 15 (fsid like '.3.15.%') , which will 
produce a full CAS.

Does this answer some of your question?

-----Richard Eckart de Castilho <[email protected]> wrote: 
-----To: "<[email protected]>" <[email protected]>
From: Richard Eckart de Castilho <[email protected]>
Date: 01/09/2013 07:49PM
Subject: Re: Requirements / Wish List for CAS Store?

> Based on what I've seen so far, I'd like to summarize what I think everyone 
> was getting at with the functionality they'd like to see in a CAS Store, and 
> then I'll ask more questions:
> 
> Uses:
>     Maintain independence of annotations with respect to independent pipelines
>     Archival Storage
>     Temporary storage between idependent pipelines 
>     Modify Type System 
>     Contain Meta Data about CAS 
>     Perform fine or coarse grained CRUD operations on a CAS 
>         
> 
> Functionality: 
>     Insert / Delete complete CAS(es)
>     Insert / Delete fragments of CAS(es) (individual SOFAs or FSes 
> [annotations])
>     Assemble CAS with SOFA and all / some / none feature structures 
>         - This would help reduce the size of CAS to its necessary components 
> before passing it to independent pipelines. It would also require the 
> construction of valid CASes for use in  Analytic Engines, complete with valid 
> Views
>          
>     Update CASes within the store (i.e, inserting annotations):
>         - This would allow for adding deltas from AEs 
> 
> Some of these might seem redundant, but I hope they give a general overview.  
> Does this seem to summarize it well ?

The U from CRUD is missing in the functionalities list: an ability to update 
existing FSes, e.g. changing a feature value, seems to be missing. I believe 
this is also not supported by the current delta-CAS support in UIMA.

Other than that it appears like a good set to start with.

> Richard and Erik also mentioned a key point: whether they should be stored as 
> Serialized binary or XMI.  I understand that XMI is not as fast as binary, 
> but the binary seems to provide little flexibility in making for fine grained 
> queries (e.g. get all annotations of type Person), but the speed is an issue. 
>  What would the trade off look like between sending fragements of CASes into 
> a pipepline and reading in a full CAS from serialized binary, or read / write 
> operations of fragments of  CAS within the Store?   

The way you put it, it appears that XMI provides for fine grained queries while 
the binary CAS does not. However, there is no support for fine grained access 
in either formats (deliberately ignoring that XMI is an XML format and could be 
stored in an XML database providing for fine-grained access).

As far as I understand it, the binary CAS format would even be more suitable to 
implement a fine-grained access, as it could be memory-mapped and annotations 
could be directly accessed on disk (random access) in very much the same way as 
the heaps of the in-memory CAS are accessed. Instead of a single file, one file 
per heap and per index may be more suitable, though. It might be possible, if 
not even straight-forward, to do a file-based implementation of the LowLevelCAS 
interface which uses memory-mapping. The in-memory implementation of the 
LowLevelCAS has been folded into CASImpl instead of being a separate class that 
could be plugged into CASImpl, which would provide for different low-level 
storage backends of CASImpl. Somebody more familiar with that part of the code 
would have to comment if refactoring out the LowLevelCas implementation form 
CASImpl and making it pluggable would be feasible or not.

The main problem of the binary CAS is that the type system is locked and cannot 
be changed. It shares this fact with the in-memory CAS, in which the type 
system likewise cannot be augmented after the CAS has been created and locked. 
Marshal suggested some time back to relax this and allow compatible changes to 
the type system to be made (new types and features) even after the creation of 
the CAS.

A secondary problem of the binary CAS is that deleted annotations do not get 
physically removed, the are only removed from the indexes. This is another fact 
it shares with the in-memory CAS. Some form of garbage collection and IDs that 
remain stable across garbage collection runs would be useful, even for the 
in-memory CAS.

The main problem of XMI is that it is slow. 

A secondary problem is that it, being an XML format, does not support certain 
characters. Another secondary problem is that random access to data in the XMI 
file is not possible - it must be streamed.

> Also as Richard mentioned, a key part about maintaining a store would be to 
> also maintain a reliably addressing system for indivual CASes and 
> annotations.  We are using OIDs  that we call Feature Structure ID (FSID). 
> Our current  FSIDs  are broken down into 4 levels: 
> CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a 
> collection of CASes that share some quality, the AnnotatorID identifies the 
> AE responsible for producing the annotations, and the Annotation ID 
> identifies the feature structure within a serialized CAS XMI. This provides 
> us the ability to perform fine / coarse grained CRUD operations within our 
> CAS store - we can insert/remove/update/delete an Annotation, create a CAS 
> from a list of FSIDS that only contain the feature structures that are 
> necessary for any Processing Element. 

The FSID format you suggest has many semantics. I think an ID for a CAS and an 
ID which reliably identifies an FS with a CAS would be sufficient to start 
with. Both IDs should be independent with no particular defined concatenation 
scheme. If the other levels of IDs are required, it should be up to an 
application to define and use them. For example, I might want to use an URL as 
collection ID or not have one at all. The ID as you put it seems to imply that 
there is a special relation between an annotator and the annotations it 
generates, which may or may not be true or desired.

Are there additional requirements hidden in the FSID format? I could imagine:

- ability to get all FSes produced by a certain annotator across all CASes in 
all collections or in a certain collection

- ability to get all CASes in a collection

- ability to get all CASes

Where do you get the annotatorId from? I see no sensible way that the UIMA 
framework can provide such an ID. There is also a conflict potential. Consider 
if analysis engine A creates an FS and analysis engine B updates a primitive 
feature in that FS. Assuming that primitives do not get an FSID since they are 
not FSes, should the annotatorID of the FS be updated to B or should it remain 
A?

> As for API and Java implementations -  would a JDBC be sufficient? 

So far I have used JDBC only with SQL databases. I don't believe it's API is 
well suited for dealing with CASes and FSes. E.g. a JDBC result set resembles 
tuples from a table, but when we work with FSes, we actually operate on an 
object graph. So the CAS provides more of an JPA-like access mechanism than a 
JDBC-like access mechanism. It appears to be that the JDBC offers much more 
functionality that would be required for a CAS store. How about storage-backed 
implementations of FSIndex and maybe of CAS itself. JDBC also has the issue 
that the query strings are oblique to the compiler and to IDE's, meaning: no 
type safety and no refactoring support.

A while back, we discussed alternative access methods for the CAS in the 
uimaFIT project. uimaFIT provides convenience methods to access the in-memory 
CAS. Consider this:

for (Token t : JCasUtil.select(jcas, Token.class)) {
 &#8230;
}

The uimaFIT API currently doesn't support predicates on feature for example. We 
considered the UIMA Constraint API to complex to use and came up with an 
"SQLCAS" approach (due to some resemblance to SQL notation, not due to an SQL 
backend or JDBC being used) - it should still have been Java and type save. 
Steven Bethard did a prototype implementation which supports something like 
this:

    DocumentAnnotation document = 
CasQuery.from(this.jCas).select(DocumentAnnotation.class).single();
    Iterator<Sentence> sentences = 
CasQuery.from(this.jCas).select(Sentence.class).iterator();
    Collection<Token> tokens = 
CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence);
    Token token = 
CasQuery.from(this.jCas).select(Token.class).matching(annotation).single();
    Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne();

More discussion on this topic, different approaches/syntaxes, and a patch for 
uimaFIT can be found in the uimaFIT issue tracker [1].

If implementing an existing standard API was a requirement, JPA 2.0 (in 
particular the criteria API) would probably provide a better level of 
abstraction than JDBC. JDO might be another (possibly better) alternative [2]. 
So far, I only had a very brief rendezvous with JDO on the Google App Engine 
and quickly dropped it again in favor of JPA because I found the latter to more 
suitable for dependency injection frameworks (@PersistenceContext annotation 
can be used in inject an EntityManager into a class, no equivalent annotation 
for JDO). JPA seems to get more "love" from tools and vendors, but JDO might be 
conceptually a better fit, cf. the following comment on IBM developer works [3] 
and other interesting comments in the same thread:

  PinakiPoddar commented Jan 23 2011:

  I had the unique honor of participating in both JDO and 
  JPA Expert group. It is unique because the two groups have
  little overlap. The answers to your question on JDO or JPA
  -- the critical difference is about datastore technology
  these two specifications aim to support. JPA is limited to 
  only relational database as data storage, whereas JDO is
  agnostic to data store technology. 

  The unequal prominence of these two specifications that have
  similar goals reflects the prevalence of relational database
  as a storage of data as compared to other non-relational
  storage mechanics. However, resurgence of interest in
  non-relational storage systems such as NoSQL may highlight
  the importance of JDO's original aim to support multiple
  data storage technologies.

To relativize this, JPA is also supported on Google's App Engine datastore 
which is backed by the BigTable NoSQL storage engine. Also the CAS is quite 
similar in structure to a RDBMS. 

Cheers,

-- Richard

[1] https://code.google.com/p/uimafit/issues/detail?id=65
[2] http://db.apache.org/jdo/jdo_v_jpa.html
[3] 
https://www.ibm.com/developerworks/mydeveloperworks/blogs/jdo/entry/december_10_2010_2_12_am38?lang=en_us#comment-1295782222672

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
[email protected] 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: Requirements / Wish List for CAS Store?

Reply via email to