On 04/10/2015 01:49 PM, Nick Hill wrote:
Quoting Eddie Epstein <[email protected]>:

Nick's work, as I understand it, was initially motivated by some UIMA
limitations and performance issues. One was the inability for multiple
threads to concurrently access a CAS read-only (this issue was fixed in
2.6.0, see https://issues.apache.org/jira/browse/UIMA-3674).

A second was a performance issue for an annotator that intensively accessed CAS feature structures. As Scott Cyphers mentions above, the current UIMA implementation has performance problems or functional limitations for other
usage scenarios as well; seems unlikely that Nick's implementation would
address Scott's scenario. It would be useful to hear how Nick's
implementation performs for others.

Those things are true, but the primary motivation was a fully threadsafe CAS and wanting to test my suspicion that having a "heap implemented within a heap" shouldn't logically be needed.

@Scott it would be good to see if the object-based impl helps with your situation. Presumably you were using the built in XXArray type feature structures? Populating these from an existing array in the new impl involves a single System.arrayCopy() whereas the current impl does a bunch more stuff.

I originally used DoubleArray but was computing on double[] so there was a lot of copying going on, plus the associated allocation and garbage collection overhead. Now I am using double[] directly. If I were using a JNI library that wrapped C vectors, I would be wanting to use them directly, without intermediate copying. I am using external resources to do this, trying to keep my API JCas-like. For example, I use the CAS as a key in the resource to get to the non-CAS structures, which are collections of partial AnnotationFS implementations.

Marshall's rework of UIMA indexes has exposed a number of implementation
bugs, mostly around edge cases. But it also exposed a number of annotators
that were violating the UIMA contract by modifying key values of an FS
without first removing the FS from the index. It has been difficult to
harden UIMA and protect users against themselves for a UIMA which only
allows a single thread read/write CAS access. I am concerned about opening
the CAS read/write to multiple concurrent threads.

Accessing the CAS from multiple threads is not something that anyone would need to do if they did not want/need to! In fact, the fact that pooling CASes is no longer needed should make things generally much simpler for users. I.e. not having to worry about when to release a CAS or accessing/corrupting a CAS which had previously been 'released'. Additionally, users can get themselves into much more trouble with the CAS heaps and LL API, so that reasoning would also support deprecation/removal of those?

I feel that a threadsafe CAS opens up all sorts of possibilities (I can elaborate if necessary), but would also be interested to hear others' views on this.

It's generally better to isolate the locking to where it's needed, which is one reason Sun switched from their initial thread-safe collection implementation to the newer unsynchronized collections, letting applications decide where synchronization was needed when it was needed.

The same can be said of serialization. Currently, as far as I can tell, feature structs in the CAS are serialized during indexing and deserialized during use. This makes it easy to pass a CAS from one process to another, but systems where that is the exception incur significant overhead, both in terms of the serialization itself, and the constraints that support the serialization, such as the type system. I think there are large systems that depend on the serialization and wouldn't want to jeopardize their operation (although one of them is probably very Jeopardized!)

Reply via email to