On 04/10/2015 01:49 PM, Nick Hill wrote:
Quoting Eddie Epstein <[email protected]>:
Nick's work, as I understand it, was initially motivated by some UIMA
limitations and performance issues. One was the inability for multiple
threads to concurrently access a CAS read-only (this issue was fixed in
2.6.0, see https://issues.apache.org/jira/browse/UIMA-3674).
A second was a performance issue for an annotator that intensively
accessed
CAS feature structures. As Scott Cyphers mentions above, the current
UIMA
implementation has performance problems or functional limitations for
other
usage scenarios as well; seems unlikely that Nick's implementation would
address Scott's scenario. It would be useful to hear how Nick's
implementation performs for others.
Those things are true, but the primary motivation was a fully
threadsafe CAS and wanting to test my suspicion that having a "heap
implemented within a heap" shouldn't logically be needed.
@Scott it would be good to see if the object-based impl helps with
your situation. Presumably you were using the built in XXArray type
feature structures? Populating these from an existing array in the new
impl involves a single System.arrayCopy() whereas the current impl
does a bunch more stuff.
I originally used DoubleArray but was computing on double[] so there was
a lot of copying going on, plus the associated allocation and garbage
collection overhead. Now I am using double[] directly. If I were using
a JNI library that wrapped C vectors, I would be wanting to use them
directly, without intermediate copying. I am using external resources
to do this, trying to keep my API JCas-like. For example, I use the CAS
as a key in the resource to get to the non-CAS structures, which are
collections of partial AnnotationFS implementations.
Marshall's rework of UIMA indexes has exposed a number of implementation
bugs, mostly around edge cases. But it also exposed a number of
annotators
that were violating the UIMA contract by modifying key values of an FS
without first removing the FS from the index. It has been difficult to
harden UIMA and protect users against themselves for a UIMA which only
allows a single thread read/write CAS access. I am concerned about
opening
the CAS read/write to multiple concurrent threads.
Accessing the CAS from multiple threads is not something that anyone
would need to do if they did not want/need to! In fact, the fact that
pooling CASes is no longer needed should make things generally much
simpler for users. I.e. not having to worry about when to release a
CAS or accessing/corrupting a CAS which had previously been 'released'.
Additionally, users can get themselves into much more trouble with the
CAS heaps and LL API, so that reasoning would also support
deprecation/removal of those?
I feel that a threadsafe CAS opens up all sorts of possibilities (I
can elaborate if necessary), but would also be interested to hear
others' views on this.
It's generally better to isolate the locking to where it's needed, which
is one reason Sun switched from their initial thread-safe collection
implementation to the newer unsynchronized collections, letting
applications decide where synchronization was needed when it was needed.
The same can be said of serialization. Currently, as far as I can tell,
feature structs in the CAS are serialized during indexing and
deserialized during use. This makes it easy to pass a CAS from one
process to another, but systems where that is the exception incur
significant overhead, both in terms of the serialization itself, and the
constraints that support the serialization, such as the type system. I
think there are large systems that depend on the serialization and
wouldn't want to jeopardize their operation (although one of them is
probably very Jeopardized!)