Re: Alternate CAS implementation

Nick Hill Wed, 01 Apr 2015 16:30:46 -0700

Thanks Richard, have put some responses inline below.



Quoting Richard Eckart de Castilho <[email protected]>:

Hi Nick,

sounds like a very interesting direction you are moving in here.
From my point of view, it would be nice if it was possible toconfigure the UIMA framework to produce either this new kind of CASor the old one without having to exchange a JAR - doing sostatically at initialization time or even dynamically at runtime.E.g. to allow easily running test cases against both implementations.

When you say "produce", there shouldn't be any visible difference inanything output or persisted, the impl is just how the CAS is storedinternally in memory while processing is happening.

It won't be possible to switch the impl being used at runtime. Thereare classes for example with the same names but different impls (e.g.CASImpl). I know this isn't ideal for tests/comparisons between thetwo impls but quite a lot of things are currently tightly-coupled tothe heap internals and so switching a jar doesn't seem too big a priceto pay given no other code changes are needed.

A branch doesn't sound as attractive to me as I think it increasesthe risk of making changes specific to this new kind of CAS that areincompatible to the old one.

It's really not a new kind of CAS. Again "externally" there shouldn'tbe visible differences and so the only kind of changes that wouldapply here would be to public interfaces, for example some new methodin the CAS interface - it would need to be implemented by the new impltoo while both existed.

My branch suggestion was more just to ask what might be a good way toshare the code for others to look at, try out, and contribute to.

Having to recompile the JCas classes is a bit of a blocker to me -but I remember that Marshall was contemplating about a way togenerate JCas classes at runtime, so this might just be a temporaryblocker.

When I say recompile, I don't mean regenerate using JCasGen, justrecompile .class files from the existing jcas .java files. I wouldexpect that you would typically only be using one version (other thanfor comparison purposes - to validate functional equivalence and/orcompare performance), and so this isn't something that would need tobe done often.

Also, in my context, we tend to rely quite heavily on binaryserialization - all kinds thereof, starting with theCasCompleteSerializer up to the recent binary forms (specifically 6).

I didn't mean to suggest that it wouldn't be possible to support this,just that the work hasn't been done yet. As mentioned this startedreally just as an experiment.I'd expect that a simpler binary format would be possible that wasn'ttied to the specific heap-based impl and maybe closer in impl to howjava's native serialization works, but of course we would want tosupport the existing formats too.

In one context, we also rely heavily on CAS addresses serving asunique identifiers of feature structures in the CAS. Does yourimplementation provide any stable feature structure IDs, preferablyones that are part of the system and not actually declared asfeatures?

Yes, there are various cases where an 'equivalent' of an FS address isrequired (for example if the LL API is being used). In this case theid gets allocated on the fly and will subsequently be unique to thatFS within the CAS. In many cases an FS might never have such an IDallocated (it's not really part of the non-LL "public" APIs), but youcan always 'request' one.

Cheers,

-- Richard

On 01.04.2015, at 08:03, Nick Hill <[email protected]> wrote:
Hi all, I work with Marshall and Eddie and have been using UIMA forsome time but am new to the mailing list.
As an experiment, I re-implemented the (java) CAS internals suchthat each feature structure corresponds to a single java objectinstead of using the custom "heaps" (monolithic arrays), andindices are built from standard java SDK (concurrent) collectionclasses.
The original motivation was to make the CAS threadsafe but I thinkthere are other benefits, the biggest of which may bereduction/simplification of the codebase.
This new impl should be fully compatible with all of the existingCAS APIs, with a few exceptions (see below). i.e. in most cases itcan be a drop-in replacement for uima-core.jar. Existing JCas coverclasses can be used but must be recompiled. I also included a"compatibility layer" for the low level CAS API so that existingusage of it should still work, but removing the heaps of courseobviates the need for it.
Summary of advantages:
- Drastic simplification of code - most proprietary data structureimpls removed, many other classes removed, index/index repo implsare about 25% of the size of the heap versions (good for futureenhancements/maintainability)- Thread safety - multiple logically independent annotators canwork on the same CAS concurrently - reading, writing and iteratingover feature structures. Opens up a lot of parallelism possibilities- No need for heap resizing or wasted space in fixed size CASbacking arrays, no large up-front memory cost for CASes - poolingthem should no longer be necessary- Unlike the current heap impl, when a FS is removed from CASindices it's space is actually freed (can be GC'd)- Unification of CAS and JCas - cover class instance (if it exists)"is" the feature structure- Significantly better performance (speed) for many use-cases,especially where there is heavy access of CAS data- Usage of standard Java data structure classes means it canbenefit more "for free" from ongoing improvements in the java SDKand from hardware optimizations targeted at these classes
Functionality not yet supported:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)
- Snapshot iterators currently return regular iterators (but alliterators are safe to use concurrently with modification)
- Multiple classloaders haven't been tested

There's also various other small loose ends and cleanup to do.
I was hoping to see if there's interest from the community intaking this further, maybe even as a replacement for the currentimpl in a future version of uima-core.
I'm not sure of the best way to share the code, but it would begreat to have a branch in the shared SCM repo where the currentprototype could be reviewed and collaboratively evolved to fill theremaining gaps.
Would welcome any comments or questions!

Thanks,
Nick

Re: Alternate CAS implementation

Reply via email to