Hi all, I work with Marshall and Eddie and have been using UIMA for some time but am new to the mailing list.

As an experiment, I re-implemented the (java) CAS internals such that each feature structure corresponds to a single java object instead of using the custom "heaps" (monolithic arrays), and indices are built from standard java SDK (concurrent) collection classes.

The original motivation was to make the CAS threadsafe but I think there are other benefits, the biggest of which may be reduction/simplification of the codebase.

This new impl should be fully compatible with all of the existing CAS APIs, with a few exceptions (see below). i.e. in most cases it can be a drop-in replacement for uima-core.jar. Existing JCas cover classes can be used but must be recompiled. I also included a "compatibility layer" for the low level CAS API so that existing usage of it should still work, but removing the heaps of course obviates the need for it.

Summary of advantages:
- Drastic simplification of code - most proprietary data structure impls removed, many other classes removed, index/index repo impls are about 25% of the size of the heap versions (good for future enhancements/maintainability) - Thread safety - multiple logically independent annotators can work on the same CAS concurrently - reading, writing and iterating over feature structures. Opens up a lot of parallelism possibilities - No need for heap resizing or wasted space in fixed size CAS backing arrays, no large up-front memory cost for CASes - pooling them should no longer be necessary - Unlike the current heap impl, when a FS is removed from CAS indices it's space is actually freed (can be GC'd) - Unification of CAS and JCas - cover class instance (if it exists) "is" the feature structure - Significantly better performance (speed) for many use-cases, especially where there is heavy access of CAS data - Usage of standard Java data structure classes means it can benefit more "for free" from ongoing improvements in the java SDK and from hardware optimizations targeted at these classes


Functionality not yet supported:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)

- Snapshot iterators currently return regular iterators (but all iterators are safe to use concurrently with modification)
- Multiple classloaders haven't been tested

There's also various other small loose ends and cleanup to do.


I was hoping to see if there's interest from the community in taking this further, maybe even as a replacement for the current impl in a future version of uima-core.

I'm not sure of the best way to share the code, but it would be great to have a branch in the shared SCM repo where the current prototype could be reviewed and collaboratively evolved to fill the remaining gaps.

Would welcome any comments or questions!

Thanks,
Nick

Reply via email to