Hi Nick, sounds like a very interesting direction you are moving in here.
From my point of view, it would be nice if it was possible to configure the UIMA framework to produce either this new kind of CAS or the old one without having to exchange a JAR - doing so statically at initialization time or even dynamically at runtime. E.g. to allow easily running test cases against both implementations. A branch doesn't sound as attractive to me as I think it increases the risk of making changes specific to this new kind of CAS that are incompatible to the old one. Having to recompile the JCas classes is a bit of a blocker to me - but I remember that Marshall was contemplating about a way to generate JCas classes at runtime, so this might just be a temporary blocker. Also, in my context, we tend to rely quite heavily on binary serialization - all kinds thereof, starting with the CasCompleteSerializer up to the recent binary forms (specifically 6). In one context, we also rely heavily on CAS addresses serving as unique identifiers of feature structures in the CAS. Does your implementation provide any stable feature structure IDs, preferably ones that are part of the system and not actually declared as features? Cheers, -- Richard On 01.04.2015, at 08:03, Nick Hill <[email protected]> wrote: > > Hi all, I work with Marshall and Eddie and have been using UIMA for some time > but am new to the mailing list. > > As an experiment, I re-implemented the (java) CAS internals such that each > feature structure corresponds to a single java object instead of using the > custom "heaps" (monolithic arrays), and indices are built from standard java > SDK (concurrent) collection classes. > > The original motivation was to make the CAS threadsafe but I think there are > other benefits, the biggest of which may be reduction/simplification of the > codebase. > > This new impl should be fully compatible with all of the existing CAS APIs, > with a few exceptions (see below). i.e. in most cases it can be a drop-in > replacement for uima-core.jar. Existing JCas cover classes can be used but > must be recompiled. I also included a "compatibility layer" for the low level > CAS API so that existing usage of it should still work, but removing the > heaps of course obviates the need for it. > > Summary of advantages: > - Drastic simplification of code - most proprietary data structure impls > removed, many other classes removed, index/index repo impls are about 25% of > the size of the heap versions (good for future enhancements/maintainability) > - Thread safety - multiple logically independent annotators can work on the > same CAS concurrently - reading, writing and iterating over feature > structures. Opens up a lot of parallelism possibilities > - No need for heap resizing or wasted space in fixed size CAS backing arrays, > no large up-front memory cost for CASes - pooling them should no longer be > necessary > - Unlike the current heap impl, when a FS is removed from CAS indices it's > space is actually freed (can be GC'd) > - Unification of CAS and JCas - cover class instance (if it exists) "is" the > feature structure > - Significantly better performance (speed) for many use-cases, especially > where there is heavy access of CAS data > - Usage of standard Java data structure classes means it can benefit more > "for free" from ongoing improvements in the java SDK and from hardware > optimizations targeted at these classes > > > Functionality not yet supported: > - Binary serialization/deserialization > - C/C++ framework (requires binary serialization) > - "Delta" CAS related function including CAS markers > - Index "auto protection" (recent 2.7 feature) > > - Snapshot iterators currently return regular iterators (but all iterators > are safe to use concurrently with modification) > - Multiple classloaders haven't been tested > > There's also various other small loose ends and cleanup to do. > > > I was hoping to see if there's interest from the community in taking this > further, maybe even as a replacement for the current impl in a future version > of uima-core. > > I'm not sure of the best way to share the code, but it would be great to have > a branch in the shared SCM repo where the current prototype could be reviewed > and collaboratively evolved to fill the remaining gaps. > > Would welcome any comments or questions! > > Thanks, > Nick
