Re: Alternate CAS implementation

Scott Cyphers Mon, 13 Apr 2015 08:54:06 -0700

On 04/10/2015 01:49 PM, Nick Hill wrote:

Quoting Eddie Epstein <[email protected]>:
Nick's work, as I understand it, was initially motivated by some UIMA
limitations and performance issues. One was the inability for multiple
threads to concurrently access a CAS read-only (this issue was fixed in
2.6.0, see https://issues.apache.org/jira/browse/UIMA-3674).
A second was a performance issue for an annotator that intensivelyaccessedCAS feature structures. As Scott Cyphers mentions above, the currentUIMAimplementation has performance problems or functional limitations forother
usage scenarios as well; seems unlikely that Nick's implementation would
address Scott's scenario. It would be useful to hear how Nick's
implementation performs for others.
Those things are true, but the primary motivation was a fullythreadsafe CAS and wanting to test my suspicion that having a "heapimplemented within a heap" shouldn't logically be needed.
@Scott it would be good to see if the object-based impl helps withyour situation. Presumably you were using the built in XXArray typefeature structures? Populating these from an existing array in the newimpl involves a single System.arrayCopy() whereas the current impldoes a bunch more stuff.

I originally used DoubleArray but was computing on double[] so there wasa lot of copying going on, plus the associated allocation and garbagecollection overhead. Now I am using double[] directly. If I were usinga JNI library that wrapped C vectors, I would be wanting to use themdirectly, without intermediate copying. I am using external resourcesto do this, trying to keep my API JCas-like. For example, I use the CASas a key in the resource to get to the non-CAS structures, which arecollections of partial AnnotationFS implementations.

Marshall's rework of UIMA indexes has exposed a number of implementation
bugs, mostly around edge cases. But it also exposed a number ofannotators
that were violating the UIMA contract by modifying key values of an FS
without first removing the FS from the index. It has been difficult to
harden UIMA and protect users against themselves for a UIMA which only
allows a single thread read/write CAS access. I am concerned aboutopening
the CAS read/write to multiple concurrent threads.
Accessing the CAS from multiple threads is not something that anyonewould need to do if they did not want/need to! In fact, the fact thatpooling CASes is no longer needed should make things generally muchsimpler for users. I.e. not having to worry about when to release aCAS or accessing/corrupting a CAS which had previously been 'released'.Additionally, users can get themselves into much more trouble with theCAS heaps and LL API, so that reasoning would also supportdeprecation/removal of those?
I feel that a threadsafe CAS opens up all sorts of possibilities (Ican elaborate if necessary), but would also be interested to hearothers' views on this.

It's generally better to isolate the locking to where it's needed, whichis one reason Sun switched from their initial thread-safe collectionimplementation to the newer unsynchronized collections, lettingapplications decide where synchronization was needed when it was needed.

The same can be said of serialization. Currently, as far as I can tell,feature structs in the CAS are serialized during indexing anddeserialized during use. This makes it easy to pass a CAS from oneprocess to another, but systems where that is the exception incursignificant overhead, both in terms of the serialization itself, and theconstraints that support the serialization, such as the type system. Ithink there are large systems that depend on the serialization andwouldn't want to jeopardize their operation (although one of them isprobably very Jeopardized!)

Re: Alternate CAS implementation

Reply via email to