>The way you put it, it appears that XMI provides for fine grained queries >while the binary CAS does not. However, there is no support for fine grained >access in either formats (deliberately ignoring that XMI is an XML format and >could be stored in an XML database providing for fine-grained access).
> Are there additional requirements hidden in the FSID format? I could imagine: > - ability to get all FSes produced by a certain annotator across all CASes in > all collections or in a certain collection > - ability to get all CASes in a collection > - ability to get all CASes > Where do you get the annotatorId from? I see no sensible way that the UIMA > framework can provide such an ID. There is also a conflict potential. > Consider if analysis engine A creates an FS and analysis engine B updates a > primitive feature in that FS. Assuming that primitives do not get an FSID > since they are not FSes, should the annotatorID of the FS be updated to B or > should it remain A? We currently assign our own annotator IDs. Annotator IDs are meant to distinguish between different UIMA applications, not individual AEs in an aggregrate or within the same pipeline. I meant to use term AnalyticID, which is more precise, so I'll use that from now on. In the strictist sense, you can imagine different Analytic IDS between different PEARs written by separated developers for separate annoators, but run along the same CAS. Now, a conflict might occur if they both had the same Type System and annotated the same CAS. This would result in a duplicate annoation for that type, but not for a FS because the FS would have a different FSID associated with. So, if only one uima application is performing an application, then the AnalyticID can remain stable throughout operations, perhaps with a default value if only one uima application is ran. A benefit of having a CAS store is more than archiving information, but to track a CAS's trajectory through analytics. It allows analytics to be developed disparitly. When a new analytic is developed (and here, I mean a new PEAR that will most likely run in a new JVM), we can run it on a old collection of CASes. If an analytic is small and only needs one or two objects from the CAS, then we can reduce message size by retrieving only those objects which are necessary. FSIDs and XMI serialization allow us to do this. With FSIDs, there is a possibility to query for a particular element or groups of elements, or even multiple CASes. Some queries are more complex than others (like getting all annotations from a particular annotator across CASes) but still manageable. I'll try to illustrate with an example from a hypothetical deserialized CAS: The following CAS was queried for an fsid with Collection ID of 3, artifcat ID of 15, and AnalyticID of 6000. The 6000 analytic looked for sentences like "lvef of 30%" and annotated for the sentence and value of the lvef (ignore for now the cop element, that is a an element we use to track provenance): <?xml version="1.0" encoding="UTF-8" standalone="no"?> <xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:cdts="http:///org/test/health/cdts.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmi:version="2.0"> <cdts:LVEF begin="449" value="20-25%" cop=".3.15.1000.1" end="490" fsid=".3.15.6000.1" sofa="1" xmi:id="13"/> </xmi:XMI> This Cas Fragment is given what we call a "transient View" or projection during a preprocessing step before running through a PEAR in UIMAj (which we externall assign an AnalyticID of 6003) that will look for the value of the LVEF , map it to a term, filter for only new objects in the CAS, and then put back into the Store, where the store writes in a new fsid: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:cdts="http:///org/test/health/cdts.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmi:version="2.0"> <cdts:SeverelyDepressedLVEF begin="449" time="2001-12-31T12:00:00" value="20-25%" cop=".3.15.6000.1" end="490" fsid=".3.15.6003.1" sofa="1" xmi:id="13"/> </xmi:XMI> This is what I mean by fine grained queries using FSID. We can also image a coarse query for a Collection 3, CAS 15 (fsid like '.3.15.%') , which will produce a full CAS. Does this answer some of your question? -----Richard Eckart de Castilho <[email protected]> wrote: -----To: "<[email protected]>" <[email protected]> From: Richard Eckart de Castilho <[email protected]> Date: 01/09/2013 07:49PM Subject: Re: Requirements / Wish List for CAS Store? > Based on what I've seen so far, I'd like to summarize what I think everyone > was getting at with the functionality they'd like to see in a CAS Store, and > then I'll ask more questions: > > Uses: > Maintain independence of annotations with respect to independent pipelines > Archival Storage > Temporary storage between idependent pipelines > Modify Type System > Contain Meta Data about CAS > Perform fine or coarse grained CRUD operations on a CAS > > > Functionality: > Insert / Delete complete CAS(es) > Insert / Delete fragments of CAS(es) (individual SOFAs or FSes > [annotations]) > Assemble CAS with SOFA and all / some / none feature structures > - This would help reduce the size of CAS to its necessary components > before passing it to independent pipelines. It would also require the > construction of valid CASes for use in Analytic Engines, complete with valid > Views > > Update CASes within the store (i.e, inserting annotations): > - This would allow for adding deltas from AEs > > Some of these might seem redundant, but I hope they give a general overview. > Does this seem to summarize it well ? The U from CRUD is missing in the functionalities list: an ability to update existing FSes, e.g. changing a feature value, seems to be missing. I believe this is also not supported by the current delta-CAS support in UIMA. Other than that it appears like a good set to start with. > Richard and Erik also mentioned a key point: whether they should be stored as > Serialized binary or XMI. I understand that XMI is not as fast as binary, > but the binary seems to provide little flexibility in making for fine grained > queries (e.g. get all annotations of type Person), but the speed is an issue. > What would the trade off look like between sending fragements of CASes into > a pipepline and reading in a full CAS from serialized binary, or read / write > operations of fragments of CAS within the Store? The way you put it, it appears that XMI provides for fine grained queries while the binary CAS does not. However, there is no support for fine grained access in either formats (deliberately ignoring that XMI is an XML format and could be stored in an XML database providing for fine-grained access). As far as I understand it, the binary CAS format would even be more suitable to implement a fine-grained access, as it could be memory-mapped and annotations could be directly accessed on disk (random access) in very much the same way as the heaps of the in-memory CAS are accessed. Instead of a single file, one file per heap and per index may be more suitable, though. It might be possible, if not even straight-forward, to do a file-based implementation of the LowLevelCAS interface which uses memory-mapping. The in-memory implementation of the LowLevelCAS has been folded into CASImpl instead of being a separate class that could be plugged into CASImpl, which would provide for different low-level storage backends of CASImpl. Somebody more familiar with that part of the code would have to comment if refactoring out the LowLevelCas implementation form CASImpl and making it pluggable would be feasible or not. The main problem of the binary CAS is that the type system is locked and cannot be changed. It shares this fact with the in-memory CAS, in which the type system likewise cannot be augmented after the CAS has been created and locked. Marshal suggested some time back to relax this and allow compatible changes to the type system to be made (new types and features) even after the creation of the CAS. A secondary problem of the binary CAS is that deleted annotations do not get physically removed, the are only removed from the indexes. This is another fact it shares with the in-memory CAS. Some form of garbage collection and IDs that remain stable across garbage collection runs would be useful, even for the in-memory CAS. The main problem of XMI is that it is slow. A secondary problem is that it, being an XML format, does not support certain characters. Another secondary problem is that random access to data in the XMI file is not possible - it must be streamed. > Also as Richard mentioned, a key part about maintaining a store would be to > also maintain a reliably addressing system for indivual CASes and > annotations. We are using OIDs that we call Feature Structure ID (FSID). > Our current FSIDs are broken down into 4 levels: > CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a > collection of CASes that share some quality, the AnnotatorID identifies the > AE responsible for producing the annotations, and the Annotation ID > identifies the feature structure within a serialized CAS XMI. This provides > us the ability to perform fine / coarse grained CRUD operations within our > CAS store - we can insert/remove/update/delete an Annotation, create a CAS > from a list of FSIDS that only contain the feature structures that are > necessary for any Processing Element. The FSID format you suggest has many semantics. I think an ID for a CAS and an ID which reliably identifies an FS with a CAS would be sufficient to start with. Both IDs should be independent with no particular defined concatenation scheme. If the other levels of IDs are required, it should be up to an application to define and use them. For example, I might want to use an URL as collection ID or not have one at all. The ID as you put it seems to imply that there is a special relation between an annotator and the annotations it generates, which may or may not be true or desired. Are there additional requirements hidden in the FSID format? I could imagine: - ability to get all FSes produced by a certain annotator across all CASes in all collections or in a certain collection - ability to get all CASes in a collection - ability to get all CASes Where do you get the annotatorId from? I see no sensible way that the UIMA framework can provide such an ID. There is also a conflict potential. Consider if analysis engine A creates an FS and analysis engine B updates a primitive feature in that FS. Assuming that primitives do not get an FSID since they are not FSes, should the annotatorID of the FS be updated to B or should it remain A? > As for API and Java implementations - would a JDBC be sufficient? So far I have used JDBC only with SQL databases. I don't believe it's API is well suited for dealing with CASes and FSes. E.g. a JDBC result set resembles tuples from a table, but when we work with FSes, we actually operate on an object graph. So the CAS provides more of an JPA-like access mechanism than a JDBC-like access mechanism. It appears to be that the JDBC offers much more functionality that would be required for a CAS store. How about storage-backed implementations of FSIndex and maybe of CAS itself. JDBC also has the issue that the query strings are oblique to the compiler and to IDE's, meaning: no type safety and no refactoring support. A while back, we discussed alternative access methods for the CAS in the uimaFIT project. uimaFIT provides convenience methods to access the in-memory CAS. Consider this: for (Token t : JCasUtil.select(jcas, Token.class)) { … } The uimaFIT API currently doesn't support predicates on feature for example. We considered the UIMA Constraint API to complex to use and came up with an "SQLCAS" approach (due to some resemblance to SQL notation, not due to an SQL backend or JDBC being used) - it should still have been Java and type save. Steven Bethard did a prototype implementation which supports something like this: DocumentAnnotation document = CasQuery.from(this.jCas).select(DocumentAnnotation.class).single(); Iterator<Sentence> sentences = CasQuery.from(this.jCas).select(Sentence.class).iterator(); Collection<Token> tokens = CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence); Token token = CasQuery.from(this.jCas).select(Token.class).matching(annotation).single(); Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne(); More discussion on this topic, different approaches/syntaxes, and a patch for uimaFIT can be found in the uimaFIT issue tracker [1]. If implementing an existing standard API was a requirement, JPA 2.0 (in particular the criteria API) would probably provide a better level of abstraction than JDBC. JDO might be another (possibly better) alternative [2]. So far, I only had a very brief rendezvous with JDO on the Google App Engine and quickly dropped it again in favor of JPA because I found the latter to more suitable for dependency injection frameworks (@PersistenceContext annotation can be used in inject an EntityManager into a class, no equivalent annotation for JDO). JPA seems to get more "love" from tools and vendors, but JDO might be conceptually a better fit, cf. the following comment on IBM developer works [3] and other interesting comments in the same thread: PinakiPoddar commented Jan 23 2011: I had the unique honor of participating in both JDO and JPA Expert group. It is unique because the two groups have little overlap. The answers to your question on JDO or JPA -- the critical difference is about datastore technology these two specifications aim to support. JPA is limited to only relational database as data storage, whereas JDO is agnostic to data store technology. The unequal prominence of these two specifications that have similar goals reflects the prevalence of relational database as a storage of data as compared to other non-relational storage mechanics. However, resurgence of interest in non-relational storage systems such as NoSQL may highlight the importance of JDO's original aim to support multiple data storage technologies. To relativize this, JPA is also supported on Google's App Engine datastore which is backed by the BigTable NoSQL storage engine. Also the CAS is quite similar in structure to a RDBMS. Cheers, -- Richard [1] https://code.google.com/p/uimafit/issues/detail?id=65 [2] http://db.apache.org/jdo/jdo_v_jpa.html [3] https://www.ibm.com/developerworks/mydeveloperworks/blogs/jdo/entry/december_10_2010_2_12_am38?lang=en_us#comment-1295782222672 -- ------------------------------------------------------------------- Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab (UKP-TUD) FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117 [email protected] www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de -------------------------------------------------------------------
