On 1/9/2013 10:48 PM, Richard Eckart de Castilho wrote:
>> Based on what I've seen so far, I'd like to summarize what I think everyone
>> was getting at with the functionality they'd like to see in a CAS Store, and
>> then I'll ask more questions:
>>
>> Uses:
>> Maintain independence of annotations with respect to independent
>> pipelines
>> Archival Storage
>> Temporary storage between idependent pipelines
>> Modify Type System
>> Contain Meta Data about CAS
>> Perform fine or coarse grained CRUD operations on a CAS
>>
>>
>> Functionality:
>> Insert / Delete complete CAS(es)
>> Insert / Delete fragments of CAS(es) (individual SOFAs or FSes
>> [annotations])
>> Assemble CAS with SOFA and all / some / none feature structures
>> - This would help reduce the size of CAS to its necessary components
>> before passing it to independent pipelines. It would also require the
>> construction of valid CASes for use in Analytic Engines, complete with
>> valid Views
>>
>> Update CASes within the store (i.e, inserting annotations):
>> - This would allow for adding deltas from AEs
>>
>> Some of these might seem redundant, but I hope they give a general overview.
>> Does this seem to summarize it well ?
> The U from CRUD is missing in the functionalities list: an ability to update
> existing FSes, e.g. changing a feature value, seems to be missing. I believe
> this is also not supported by the current delta-CAS support in UIMA.
The delta-CAS, in addition to sending back newly created feature structures,
also records modifications of any slots in feature structures that were
pre-existing, so I think this is supported.
-Marshall
>
> Other than that it appears like a good set to start with.
>
>> Richard and Erik also mentioned a key point: whether they should be stored
>> as Serialized binary or XMI. I understand that XMI is not as fast as
>> binary, but the binary seems to provide little flexibility in making for
>> fine grained queries (e.g. get all annotations of type Person), but the
>> speed is an issue. What would the trade off look like between sending
>> fragements of CASes into a pipepline and reading in a full CAS from
>> serialized binary, or read / write operations of fragments of CAS within
>> the Store?
> The way you put it, it appears that XMI provides for fine grained queries
> while the binary CAS does not. However, there is no support for fine grained
> access in either formats (deliberately ignoring that XMI is an XML format and
> could be stored in an XML database providing for fine-grained access).
>
> As far as I understand it, the binary CAS format would even be more suitable
> to implement a fine-grained access, as it could be memory-mapped and
> annotations could be directly accessed on disk (random access) in very much
> the same way as the heaps of the in-memory CAS are accessed. Instead of a
> single file, one file per heap and per index may be more suitable, though. It
> might be possible, if not even straight-forward, to do a file-based
> implementation of the LowLevelCAS interface which uses memory-mapping. The
> in-memory implementation of the LowLevelCAS has been folded into CASImpl
> instead of being a separate class that could be plugged into CASImpl, which
> would provide for different low-level storage backends of CASImpl. Somebody
> more familiar with that part of the code would have to comment if refactoring
> out the LowLevelCas implementation form CASImpl and making it pluggable would
> be feasible or not.
>
> The main problem of the binary CAS is that the type system is locked and
> cannot be changed. It shares this fact with the in-memory CAS, in which the
> type system likewise cannot be augmented after the CAS has been created and
> locked. Marshal suggested some time back to relax this and allow compatible
> changes to the type system to be made (new types and features) even after the
> creation of the CAS.
>
> A secondary problem of the binary CAS is that deleted annotations do not get
> physically removed, the are only removed from the indexes. This is another
> fact it shares with the in-memory CAS. Some form of garbage collection and
> IDs that remain stable across garbage collection runs would be useful, even
> for the in-memory CAS.
>
> The main problem of XMI is that it is slow.
>
> A secondary problem is that it, being an XML format, does not support certain
> characters. Another secondary problem is that random access to data in the
> XMI file is not possible - it must be streamed.
>
>> Also as Richard mentioned, a key part about maintaining a store would be to
>> also maintain a reliably addressing system for indivual CASes and
>> annotations. We are using OIDs that we call Feature Structure ID (FSID).
>> Our current FSIDs are broken down into 4 levels:
>> CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a
>> collection of CASes that share some quality, the AnnotatorID identifies the
>> AE responsible for producing the annotations, and the Annotation ID
>> identifies the feature structure within a serialized CAS XMI. This provides
>> us the ability to perform fine / coarse grained CRUD operations within our
>> CAS store - we can insert/remove/update/delete an Annotation, create a CAS
>> from a list of FSIDS that only contain the feature structures that are
>> necessary for any Processing Element.
> The FSID format you suggest has many semantics. I think an ID for a CAS and
> an ID which reliably identifies an FS with a CAS would be sufficient to start
> with. Both IDs should be independent with no particular defined concatenation
> scheme. If the other levels of IDs are required, it should be up to an
> application to define and use them. For example, I might want to use an URL
> as collection ID or not have one at all. The ID as you put it seems to imply
> that there is a special relation between an annotator and the annotations it
> generates, which may or may not be true or desired.
>
> Are there additional requirements hidden in the FSID format? I could imagine:
>
> - ability to get all FSes produced by a certain annotator across all CASes in
> all collections or in a certain collection
>
> - ability to get all CASes in a collection
>
> - ability to get all CASes
>
> Where do you get the annotatorId from? I see no sensible way that the UIMA
> framework can provide such an ID. There is also a conflict potential.
> Consider if analysis engine A creates an FS and analysis engine B updates a
> primitive feature in that FS. Assuming that primitives do not get an FSID
> since they are not FSes, should the annotatorID of the FS be updated to B or
> should it remain A?
>
>> As for API and Java implementations - would a JDBC be sufficient?
> So far I have used JDBC only with SQL databases. I don't believe it's API is
> well suited for dealing with CASes and FSes. E.g. a JDBC result set resembles
> tuples from a table, but when we work with FSes, we actually operate on an
> object graph. So the CAS provides more of an JPA-like access mechanism than a
> JDBC-like access mechanism. It appears to be that the JDBC offers much more
> functionality that would be required for a CAS store. How about
> storage-backed implementations of FSIndex and maybe of CAS itself. JDBC also
> has the issue that the query strings are oblique to the compiler and to
> IDE's, meaning: no type safety and no refactoring support.
>
> A while back, we discussed alternative access methods for the CAS in the
> uimaFIT project. uimaFIT provides convenience methods to access the in-memory
> CAS. Consider this:
>
> for (Token t : JCasUtil.select(jcas, Token.class)) {
> …
> }
>
> The uimaFIT API currently doesn't support predicates on feature for example.
> We considered the UIMA Constraint API to complex to use and came up with an
> "SQLCAS" approach (due to some resemblance to SQL notation, not due to an SQL
> backend or JDBC being used) - it should still have been Java and type save.
> Steven Bethard did a prototype implementation which supports something like
> this:
>
> DocumentAnnotation document =
> CasQuery.from(this.jCas).select(DocumentAnnotation.class).single();
> Iterator<Sentence> sentences =
> CasQuery.from(this.jCas).select(Sentence.class).iterator();
> Collection<Token> tokens =
> CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence);
> Token token =
> CasQuery.from(this.jCas).select(Token.class).matching(annotation).single();
> Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne();
>
> More discussion on this topic, different approaches/syntaxes, and a patch for
> uimaFIT can be found in the uimaFIT issue tracker [1].
>
> If implementing an existing standard API was a requirement, JPA 2.0 (in
> particular the criteria API) would probably provide a better level of
> abstraction than JDBC. JDO might be another (possibly better) alternative
> [2]. So far, I only had a very brief rendezvous with JDO on the Google App
> Engine and quickly dropped it again in favor of JPA because I found the
> latter to more suitable for dependency injection frameworks
> (@PersistenceContext annotation can be used in inject an EntityManager into a
> class, no equivalent annotation for JDO). JPA seems to get more "love" from
> tools and vendors, but JDO might be conceptually a better fit, cf. the
> following comment on IBM developer works [3] and other interesting comments
> in the same thread:
>
> PinakiPoddar commented Jan 23 2011:
>
> I had the unique honor of participating in both JDO and
> JPA Expert group. It is unique because the two groups have
> little overlap. The answers to your question on JDO or JPA
> -- the critical difference is about datastore technology
> these two specifications aim to support. JPA is limited to
> only relational database as data storage, whereas JDO is
> agnostic to data store technology.
>
> The unequal prominence of these two specifications that have
> similar goals reflects the prevalence of relational database
> as a storage of data as compared to other non-relational
> storage mechanics. However, resurgence of interest in
> non-relational storage systems such as NoSQL may highlight
> the importance of JDO's original aim to support multiple
> data storage technologies.
>
> To relativize this, JPA is also supported on Google's App Engine datastore
> which is backed by the BigTable NoSQL storage engine. Also the CAS is quite
> similar in structure to a RDBMS.
>
> Cheers,
>
> -- Richard
>
>
> [1] https://code.google.com/p/uimafit/issues/detail?id=65
> [2] http://db.apache.org/jdo/jdo_v_jpa.html
> [3]
> https://www.ibm.com/developerworks/mydeveloperworks/blogs/jdo/entry/december_10_2010_2_12_am38?lang=en_us#comment-1295782222672
>