Nick Hill created UIMA-4329:
-------------------------------
Summary: Object-based CAS implementation proposal/prototype
Key: UIMA-4329
URL: https://issues.apache.org/jira/browse/UIMA-4329
Project: UIMA
Issue Type: Brainstorming
Components: Core Java Framework
Reporter: Nick Hill
Priority: Minor
I have been experimenting with a simplified CAS implementation where each
feature structure is an object and the indices are based on standard Java SDK
concurrent collection classes. This replaces the complex custom array-based
heaps and index implementations.
The primary motivation was to make the CAS threadsafe so that multiple
annotators could process one concurrently, but I think there are a number of
other benefits.
Summary of advantages:
- Drastic simplification of code - most proprietary data structure impls
removed, many other classes removed, index/index repo impls are about 25% of
the size of the heap versions (good for future enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work on the
same CAS concurrently - reading, writing and iterating over feature structures.
Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing arrays,
no large up-front memory cost for CASes - pooling them should no longer be
necessary
- Unlike the current heap impl, when a FS is removed from CAS indices it's
space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists) "is" the
feature structure
- Significantly better performance (speed) for many use-cases, especially where
there is heavy access of CAS data
- Usage of standard Java data structure classes means it can benefit more "for
free" from ongoing improvements in the java SDK and from hardware optimizations
targeted at these classes
I was hoping to see if there's interest from the community in taking this
further, maybe even as a replacement for the current impl in a future version
of uima-core. There has already been some discussion on the mailing list under
the subject "Alternate CAS implementation".
I'm attaching the current prototype, which should support most existing UIMA
functionality with the exception of:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)
Note I don't mean to imply these things can't be supported, just that they
aren't yet.
Where these things aren't used it should be possible to try out the attached
uima-core.jar as a drop-in replacement with existing apps/frameworks. An
important caveat though is that any existing JCas cover classes will need
recompiling with the new jar (but not re-JCasGenning).
I'll also attach the code. I started by basically ripping out the CAS heaps, so
there's a lot of code which is just commented out (e.g. in CASImpl.java). Lots
of cleanup/tidyup is still needed, and theres various places which still need
fixing for threadsafety (e.g. synchronization around some existing
create-on-first-access logic.. this is separate to the indices though). But
those things shouldn't affect existing usage. A convention I followed was not
to rename modified classes (e.g. CASImpl), but where an equivalent impl was
created from scratch I did give it a new name starting with "CC" (e.g.
FeatureStructureImpl is now CCFeatureStructure). The cc stood for "concurrent
CAS". I have kept it in sync with the latest compatible changes in the
uima-core stream, apart from those related to the non-impl'd functions
mentioned above.
Most of the "valid" unit tests work. Some are tied to the internals and no
longer apply, many don't compile because they use binary serialization and/or
delta CAS related classes which I removed for the time being. Some others I had
to generalize a bit because for example they assumed a specific order in places
where the order should be arbitrary, and maybe some other similar reasons.
md5 checksums:
{{4fd19b5f804fe8d505f697240c8e0366 *uima-core.jar}}
{{51826aa44111b7f6e1fa307393eda8f4 *uimaj-core_obj.tar.gz}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)