> Based on what I've seen so far, I'd like to summarize what I think everyone
> was getting at with the functionality they'd like to see in a CAS Store, and
> then I'll ask more questions:
>
> Uses:
> Maintain independence of annotations with respect to independent pipelines
> Archival Storage
> Temporary storage between idependent pipelines
> Modify Type System
> Contain Meta Data about CAS
> Perform fine or coarse grained CRUD operations on a CAS
>
>
> Functionality:
> Insert / Delete complete CAS(es)
> Insert / Delete fragments of CAS(es) (individual SOFAs or FSes
> [annotations])
> Assemble CAS with SOFA and all / some / none feature structures
> - This would help reduce the size of CAS to its necessary components
> before passing it to independent pipelines. It would also require the
> construction of valid CASes for use in Analytic Engines, complete with valid
> Views
>
> Update CASes within the store (i.e, inserting annotations):
> - This would allow for adding deltas from AEs
>
> Some of these might seem redundant, but I hope they give a general overview.
> Does this seem to summarize it well ?
The U from CRUD is missing in the functionalities list: an ability to update
existing FSes, e.g. changing a feature value, seems to be missing. I believe
this is also not supported by the current delta-CAS support in UIMA.
Other than that it appears like a good set to start with.
> Richard and Erik also mentioned a key point: whether they should be stored as
> Serialized binary or XMI. I understand that XMI is not as fast as binary,
> but the binary seems to provide little flexibility in making for fine grained
> queries (e.g. get all annotations of type Person), but the speed is an issue.
> What would the trade off look like between sending fragements of CASes into
> a pipepline and reading in a full CAS from serialized binary, or read / write
> operations of fragments of CAS within the Store?
The way you put it, it appears that XMI provides for fine grained queries while
the binary CAS does not. However, there is no support for fine grained access
in either formats (deliberately ignoring that XMI is an XML format and could be
stored in an XML database providing for fine-grained access).
As far as I understand it, the binary CAS format would even be more suitable to
implement a fine-grained access, as it could be memory-mapped and annotations
could be directly accessed on disk (random access) in very much the same way as
the heaps of the in-memory CAS are accessed. Instead of a single file, one file
per heap and per index may be more suitable, though. It might be possible, if
not even straight-forward, to do a file-based implementation of the LowLevelCAS
interface which uses memory-mapping. The in-memory implementation of the
LowLevelCAS has been folded into CASImpl instead of being a separate class that
could be plugged into CASImpl, which would provide for different low-level
storage backends of CASImpl. Somebody more familiar with that part of the code
would have to comment if refactoring out the LowLevelCas implementation form
CASImpl and making it pluggable would be feasible or not.
The main problem of the binary CAS is that the type system is locked and cannot
be changed. It shares this fact with the in-memory CAS, in which the type
system likewise cannot be augmented after the CAS has been created and locked.
Marshal suggested some time back to relax this and allow compatible changes to
the type system to be made (new types and features) even after the creation of
the CAS.
A secondary problem of the binary CAS is that deleted annotations do not get
physically removed, the are only removed from the indexes. This is another fact
it shares with the in-memory CAS. Some form of garbage collection and IDs that
remain stable across garbage collection runs would be useful, even for the
in-memory CAS.
The main problem of XMI is that it is slow.
A secondary problem is that it, being an XML format, does not support certain
characters. Another secondary problem is that random access to data in the XMI
file is not possible - it must be streamed.
> Also as Richard mentioned, a key part about maintaining a store would be to
> also maintain a reliably addressing system for indivual CASes and
> annotations. We are using OIDs that we call Feature Structure ID (FSID).
> Our current FSIDs are broken down into 4 levels:
> CollectionID.CASID.AnnotatorID.AnnotationID, where CollectionID refers to a
> collection of CASes that share some quality, the AnnotatorID identifies the
> AE responsible for producing the annotations, and the Annotation ID
> identifies the feature structure within a serialized CAS XMI. This provides
> us the ability to perform fine / coarse grained CRUD operations within our
> CAS store - we can insert/remove/update/delete an Annotation, create a CAS
> from a list of FSIDS that only contain the feature structures that are
> necessary for any Processing Element.
The FSID format you suggest has many semantics. I think an ID for a CAS and an
ID which reliably identifies an FS with a CAS would be sufficient to start
with. Both IDs should be independent with no particular defined concatenation
scheme. If the other levels of IDs are required, it should be up to an
application to define and use them. For example, I might want to use an URL as
collection ID or not have one at all. The ID as you put it seems to imply that
there is a special relation between an annotator and the annotations it
generates, which may or may not be true or desired.
Are there additional requirements hidden in the FSID format? I could imagine:
- ability to get all FSes produced by a certain annotator across all CASes in
all collections or in a certain collection
- ability to get all CASes in a collection
- ability to get all CASes
Where do you get the annotatorId from? I see no sensible way that the UIMA
framework can provide such an ID. There is also a conflict potential. Consider
if analysis engine A creates an FS and analysis engine B updates a primitive
feature in that FS. Assuming that primitives do not get an FSID since they are
not FSes, should the annotatorID of the FS be updated to B or should it remain
A?
> As for API and Java implementations - would a JDBC be sufficient?
So far I have used JDBC only with SQL databases. I don't believe it's API is
well suited for dealing with CASes and FSes. E.g. a JDBC result set resembles
tuples from a table, but when we work with FSes, we actually operate on an
object graph. So the CAS provides more of an JPA-like access mechanism than a
JDBC-like access mechanism. It appears to be that the JDBC offers much more
functionality that would be required for a CAS store. How about storage-backed
implementations of FSIndex and maybe of CAS itself. JDBC also has the issue
that the query strings are oblique to the compiler and to IDE's, meaning: no
type safety and no refactoring support.
A while back, we discussed alternative access methods for the CAS in the
uimaFIT project. uimaFIT provides convenience methods to access the in-memory
CAS. Consider this:
for (Token t : JCasUtil.select(jcas, Token.class)) {
…
}
The uimaFIT API currently doesn't support predicates on feature for example. We
considered the UIMA Constraint API to complex to use and came up with an
"SQLCAS" approach (due to some resemblance to SQL notation, not due to an SQL
backend or JDBC being used) - it should still have been Java and type save.
Steven Bethard did a prototype implementation which supports something like
this:
DocumentAnnotation document =
CasQuery.from(this.jCas).select(DocumentAnnotation.class).single();
Iterator<Sentence> sentences =
CasQuery.from(this.jCas).select(Sentence.class).iterator();
Collection<Token> tokens =
CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence);
Token token =
CasQuery.from(this.jCas).select(Token.class).matching(annotation).single();
Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne();
More discussion on this topic, different approaches/syntaxes, and a patch for
uimaFIT can be found in the uimaFIT issue tracker [1].
If implementing an existing standard API was a requirement, JPA 2.0 (in
particular the criteria API) would probably provide a better level of
abstraction than JDBC. JDO might be another (possibly better) alternative [2].
So far, I only had a very brief rendezvous with JDO on the Google App Engine
and quickly dropped it again in favor of JPA because I found the latter to more
suitable for dependency injection frameworks (@PersistenceContext annotation
can be used in inject an EntityManager into a class, no equivalent annotation
for JDO). JPA seems to get more "love" from tools and vendors, but JDO might be
conceptually a better fit, cf. the following comment on IBM developer works [3]
and other interesting comments in the same thread:
PinakiPoddar commented Jan 23 2011:
I had the unique honor of participating in both JDO and
JPA Expert group. It is unique because the two groups have
little overlap. The answers to your question on JDO or JPA
-- the critical difference is about datastore technology
these two specifications aim to support. JPA is limited to
only relational database as data storage, whereas JDO is
agnostic to data store technology.
The unequal prominence of these two specifications that have
similar goals reflects the prevalence of relational database
as a storage of data as compared to other non-relational
storage mechanics. However, resurgence of interest in
non-relational storage systems such as NoSQL may highlight
the importance of JDO's original aim to support multiple
data storage technologies.
To relativize this, JPA is also supported on Google's App Engine datastore
which is backed by the BigTable NoSQL storage engine. Also the CAS is quite
similar in structure to a RDBMS.
Cheers,
-- Richard
[1] https://code.google.com/p/uimafit/issues/detail?id=65
[2] http://db.apache.org/jdo/jdo_v_jpa.html
[3]
https://www.ibm.com/developerworks/mydeveloperworks/blogs/jdo/entry/december_10_2010_2_12_am38?lang=en_us#comment-1295782222672
--
-------------------------------------------------------------------
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD)
FB 20 Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
[email protected]
www.ukp.tu-darmstadt.de
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------