Hi all, I work with Marshall and Eddie and have been using UIMA for
some time but am new to the mailing list.
As an experiment, I re-implemented the (java) CAS internals such that
each feature structure corresponds to a single java object instead of
using the custom "heaps" (monolithic arrays), and indices are built
from standard java SDK (concurrent) collection classes.
The original motivation was to make the CAS threadsafe but I think
there are other benefits, the biggest of which may be
reduction/simplification of the codebase.
This new impl should be fully compatible with all of the existing CAS
APIs, with a few exceptions (see below). i.e. in most cases it can be
a drop-in replacement for uima-core.jar. Existing JCas cover classes
can be used but must be recompiled. I also included a "compatibility
layer" for the low level CAS API so that existing usage of it should
still work, but removing the heaps of course obviates the need for it.
Summary of advantages:
- Drastic simplification of code - most proprietary data structure
impls removed, many other classes removed, index/index repo impls are
about 25% of the size of the heap versions (good for future
enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work
on the same CAS concurrently - reading, writing and iterating over
feature structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing
arrays, no large up-front memory cost for CASes - pooling them should
no longer be necessary
- Unlike the current heap impl, when a FS is removed from CAS indices
it's space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists)
"is" the feature structure
- Significantly better performance (speed) for many use-cases,
especially where there is heavy access of CAS data
- Usage of standard Java data structure classes means it can benefit
more "for free" from ongoing improvements in the java SDK and from
hardware optimizations targeted at these classes
Functionality not yet supported:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)
- Snapshot iterators currently return regular iterators (but all
iterators are safe to use concurrently with modification)
- Multiple classloaders haven't been tested
There's also various other small loose ends and cleanup to do.
I was hoping to see if there's interest from the community in taking
this further, maybe even as a replacement for the current impl in a
future version of uima-core.
I'm not sure of the best way to share the code, but it would be great
to have a branch in the shared SCM repo where the current prototype
could be reviewed and collaboratively evolved to fill the remaining
gaps.
Would welcome any comments or questions!
Thanks,
Nick