Re: Requirements / Wish List for CAS Store?

Marshall Schor Wed, 09 Jan 2013 20:27:50 -0800

On 1/9/2013 10:48 PM, Richard Eckart de Castilho wrote:
>> Based on what I've seen so far, I'd like to summarize what I think everyone 
>> was getting at with the functionality they'd like to see in a CAS Store, and 
>> then I'll ask more questions:
>>
>> Uses:
>>     Maintain independence of annotations with respect to independent 
>> pipelines
>>     Archival Storage
>>     Temporary storage between idependent pipelines 
>>     Modify Type System 
>>     Contain Meta Data about CAS 
>>     Perform fine or coarse grained CRUD operations on a CAS 
>>         
>>
>> Functionality: 
>>     Insert / Delete complete CAS(es)
>>     Insert / Delete fragments of CAS(es) (individual SOFAs or FSes 
>> [annotations])
>>     Assemble CAS with SOFA and all / some / none feature structures 
>>         - This would help reduce the size of CAS to its necessary components 
>> before passing it to independent pipelines. It would also require the 
>> construction of valid CASes for use in  Analytic Engines, complete with 
>> valid Views
>>          
>>     Update CASes within the store (i.e, inserting annotations):
>>         - This would allow for adding deltas from AEs 
>>
>> Some of these might seem redundant, but I hope they give a general overview. 
>>  Does this seem to summarize it well ?
> The U from CRUD is missing in the functionalities list: an ability to update 
> existing FSes, e.g. changing a feature value, seems to be missing. I believe 
> this is also not supported by the current delta-CAS support in UIMA.
The delta-CAS, in addition to sending back newly created feature structures,
also records modifications of any slots in feature structures that were
pre-existing, so I think this is supported.


-Marshall
>
> Other than that it appears like a good set to start with.
>
>> Richard and Erik also mentioned a key point: whether they should be stored 
>> as Serialized binary or XMI.  I understand that XMI is not as fast as 
>> binary, but the binary seems to provide little flexibility in making for 
>> fine grained queries (e.g. get all annotations of type Person), but the 
>> speed is an issue.  What would the trade off look like between sending 
>> fragements of CASes into a pipepline and reading in a full CAS from 
>> serialized binary, or read / write operations of fragments of  CAS within 
>> the Store?   
> The way you put it, it appears that XMI provides for fine grained queries 
> while the binary CAS does not. However, there is no support for fine grained 
> access in either formats (deliberately ignoring that XMI is an XML format and 
> could be stored in an XML database providing for fine-grained access).
>
> As far as I understand it, the binary CAS format would even be more suitable 
> to implement a fine-grained access, as it could be memory-mapped and 
> annotations could be directly accessed on disk (random access) in very much 
> the same way as the heaps of the in-memory CAS are accessed. Instead of a 
> single file, one file per heap and per index may be more suitable, though. It 
> might be possible, if not even straight-forward, to do a file-based 
> implementation of the LowLevelCAS interface which uses memory-mapping. The 
> in-memory implementation of the LowLevelCAS has been folded into CASImpl 
> instead of being a separate class that could be plugged into CASImpl, which 
> would provide for different low-level storage backends of CASImpl. Somebody 
> more familiar with that part of the code would have to comment if refactoring 
> out the LowLevelCas implementation form CASImpl and making it pluggable would 
> be feasible or not.
>
> The main problem of the binary CAS is that the type system is locked and 
> cannot be changed. It shares this fact with the in-memory CAS, in which the 
> type system likewise cannot be augmented after the CAS has been created and 
> locked. Marshal suggested some time back to relax this and allow compatible 
> changes to the type system to be made (new types and features) even after the 
> creation of the CAS.
>
> A secondary problem of the binary CAS is that deleted annotations do not get 
> physically removed, the are only removed from the indexes. This is another 
> fact it shares with the in-memory CAS. Some form of garbage collection and 
> IDs that remain stable across garbage collection runs would be useful, even 
> for the in-memory CAS.
>
> The main problem of XMI is that it is slow. 
>
> A secondary problem is that it, being an XML format, does not support certain 
> characters. Another secondary problem is that random access to data in the 
> XMI file is not possible - it must be streamed.
>
>> Also as Richard mentioned, a key part about maintaining a store would be to 
>> also maintain a reliably addressing system for indivual CASes and 
>> annotations.  We are using OIDs  that we call Feature Structure ID (FSID). 
>> Our current  FSIDs  are broken down into 4 levels: 
>> CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a 
>> collection of CASes that share some quality, the AnnotatorID identifies the 
>> AE responsible for producing the annotations, and the Annotation ID 
>> identifies the feature structure within a serialized CAS XMI. This provides 
>> us the ability to perform fine / coarse grained CRUD operations within our 
>> CAS store - we can insert/remove/update/delete an Annotation, create a CAS 
>> from a list of FSIDS that only contain the feature structures that are 
>> necessary for any Processing Element. 
> The FSID format you suggest has many semantics. I think an ID for a CAS and 
> an ID which reliably identifies an FS with a CAS would be sufficient to start 
> with. Both IDs should be independent with no particular defined concatenation 
> scheme. If the other levels of IDs are required, it should be up to an 
> application to define and use them. For example, I might want to use an URL 
> as collection ID or not have one at all. The ID as you put it seems to imply 
> that there is a special relation between an annotator and the annotations it 
> generates, which may or may not be true or desired.
>
> Are there additional requirements hidden in the FSID format? I could imagine:
>
> - ability to get all FSes produced by a certain annotator across all CASes in 
> all collections or in a certain collection
>
> - ability to get all CASes in a collection
>
> - ability to get all CASes
>
> Where do you get the annotatorId from? I see no sensible way that the UIMA 
> framework can provide such an ID. There is also a conflict potential. 
> Consider if analysis engine A creates an FS and analysis engine B updates a 
> primitive feature in that FS. Assuming that primitives do not get an FSID 
> since they are not FSes, should the annotatorID of the FS be updated to B or 
> should it remain A?
>
>> As for API and Java implementations -  would a JDBC be sufficient? 
> So far I have used JDBC only with SQL databases. I don't believe it's API is 
> well suited for dealing with CASes and FSes. E.g. a JDBC result set resembles 
> tuples from a table, but when we work with FSes, we actually operate on an 
> object graph. So the CAS provides more of an JPA-like access mechanism than a 
> JDBC-like access mechanism. It appears to be that the JDBC offers much more 
> functionality that would be required for a CAS store. How about 
> storage-backed implementations of FSIndex and maybe of CAS itself. JDBC also 
> has the issue that the query strings are oblique to the compiler and to 
> IDE's, meaning: no type safety and no refactoring support.
>
> A while back, we discussed alternative access methods for the CAS in the 
> uimaFIT project. uimaFIT provides convenience methods to access the in-memory 
> CAS. Consider this:
>
> for (Token t : JCasUtil.select(jcas, Token.class)) {
>  …
> }
>
> The uimaFIT API currently doesn't support predicates on feature for example. 
> We considered the UIMA Constraint API to complex to use and came up with an 
> "SQLCAS" approach (due to some resemblance to SQL notation, not due to an SQL 
> backend or JDBC being used) - it should still have been Java and type save. 
> Steven Bethard did a prototype implementation which supports something like 
> this:
>
>     DocumentAnnotation document = 
> CasQuery.from(this.jCas).select(DocumentAnnotation.class).single();
>     Iterator<Sentence> sentences = 
> CasQuery.from(this.jCas).select(Sentence.class).iterator();
>     Collection<Token> tokens = 
> CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence);
>     Token token = 
> CasQuery.from(this.jCas).select(Token.class).matching(annotation).single();
>     Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne();
>
> More discussion on this topic, different approaches/syntaxes, and a patch for 
> uimaFIT can be found in the uimaFIT issue tracker [1].
>
> If implementing an existing standard API was a requirement, JPA 2.0 (in 
> particular the criteria API) would probably provide a better level of 
> abstraction than JDBC. JDO might be another (possibly better) alternative 
> [2]. So far, I only had a very brief rendezvous with JDO on the Google App 
> Engine and quickly dropped it again in favor of JPA because I found the 
> latter to more suitable for dependency injection frameworks 
> (@PersistenceContext annotation can be used in inject an EntityManager into a 
> class, no equivalent annotation for JDO). JPA seems to get more "love" from 
> tools and vendors, but JDO might be conceptually a better fit, cf. the 
> following comment on IBM developer works [3] and other interesting comments 
> in the same thread:
>
>   PinakiPoddar commented Jan 23 2011:
>
>   I had the unique honor of participating in both JDO and 
>   JPA Expert group. It is unique because the two groups have
>   little overlap. The answers to your question on JDO or JPA
>   -- the critical difference is about datastore technology
>   these two specifications aim to support. JPA is limited to 
>   only relational database as data storage, whereas JDO is
>   agnostic to data store technology. 
>  
>   The unequal prominence of these two specifications that have
>   similar goals reflects the prevalence of relational database
>   as a storage of data as compared to other non-relational
>   storage mechanics. However, resurgence of interest in
>   non-relational storage systems such as NoSQL may highlight
>   the importance of JDO's original aim to support multiple
>   data storage technologies.
>
> To relativize this, JPA is also supported on Google's App Engine datastore 
> which is backed by the BigTable NoSQL storage engine. Also the CAS is quite 
> similar in structure to a RDBMS. 
>
> Cheers,
>
> -- Richard
>
>
> [1] https://code.google.com/p/uimafit/issues/detail?id=65
> [2] http://db.apache.org/jdo/jdo_v_jpa.html
> [3] 
> https://www.ibm.com/developerworks/mydeveloperworks/blogs/jdo/entry/december_10_2010_2_12_am38?lang=en_us#comment-1295782222672
>

Re: Requirements / Wish List for CAS Store?

Reply via email to