Re: Alternate CAS implementation

Nick Hill Thu, 02 Apr 2015 12:49:53 -0700

Thanks Richard, more replies below...

Quoting Richard Eckart de Castilho <[email protected]>:

Hi Nick,

On 02.04.2015, at 01:37, Nick Hill <[email protected]> wrote:
From my point of view, it would be nice if it was possible toconfigure the UIMA framework to produce either this new kind ofCAS or the old one without having to exchange a JAR - doing sostatically at initialization time or even dynamically at runtime.E.g. to allow easily running test cases against bothimplementations.
When you say "produce", there shouldn't be any visible differencein anything output or persisted, the impl is just how the CAS isstored internally in memory while processing is happening.
It won't be possible to switch the impl being used at runtime.There are classes for example with the same names but differentimpls (e.g. CASImpl). I know this isn't ideal for tests/comparisonsbetween the two impls but quite a lot of things are currentlytightly-coupled to the heap internals and so switching a jardoesn't seem too big a price to pay given no other code changes areneeded.
What do you plan to be the ultimate goal of this experiment? Is itto support different CAS implementations or is it to replace theexisting CAS implementation with a totally different one?
Most things in UIMA are created through factories (not the CAS sofar). So theoretically, one could replace most classes by customclasses by reconfiguring the framework to use different factoryclasses or having the factories produce different implementations.Can you imagine that as well for the CAS?

For users the implementation shouldn't matter. They shouldn't observeany functional difference and therefore shouldn't really care if theimpl changes underneath. All consuming code should work as-is, withthe exception of code which accesses 'internals' directly - but I'dsee this as analogous to accessing private fields in some java SDKclass, which breaks when those fields change in a newer SDK version.

As such I don't think it would make sense (or be very practical from amaintenance pov) to support two implementations concurrently or tohave a factory.

Does it mean that the UIMA-C++ implementation is going to bediscontinued officially?

No, just to clarify no agreements or plans have been made. I justwanted to initiate a discussion around this as a possible idea.If we were to pursue this alternate implementation, I don't know ofany reason why the C++ impl would be discontinued. I had just listedC++ AEs as one of the things which don't yet work with my currentprototype.

Having to recompile the JCas classes is a bit of a blocker to me -but I remember that Marshall was contemplating about a way togenerate JCas classes at runtime, so this might just be atemporary blocker.
When I say recompile, I don't mean regenerate using JCasGen, justrecompile .class files from the existing jcas .java files. I wouldexpect that you would typically only be using one version (otherthan for comparison purposes - to validate functional equivalenceand/or compare performance), and so this isn't something that wouldneed to be done often.
Compiled JCas classes tend to be shipped as part of frameworks. Thismeans that it will not be possible to switch to a new CAS impl justby replacing a JAR. It will also mean that components from differentUIMA-based frameworks cannot be mixed and matched anymore unlesssome broker like UIMA-AS is used.

The current JCas cover class format is quite old and tightly-coupledto the heap-based CAS internals. Saying that all new versions of UIMAmust be binary-compatible with these therefore imposes a (somewhatcrippling) restriction on possible internal improvements. You mightsay that the current JCas classes break standardabstraction/encapsulation principles if the expectation is they willbe forever forwards binary-compatible.

It would not be hard on the UIMA side to move to a simpler and moreabstract JCas cover class format that should avoid this problem infuture, but the actual move to such a format would be even moredisruptive than requiring a recompilation (would require are-JCasGen), and would have the same issues you mention above.

I managed to make this object-based impl at least source-compatiblewith existing jcas cover classes, by 'converting' the impl of methodscalled that were intended to make CAS heap changes to actually bemanipulating the FS objects directly.

In one context, we also rely heavily on CAS addresses serving asunique identifiers of feature structures in the CAS. Does yourimplementation provide any stable feature structure IDs,preferably ones that are part of the system and not actuallydeclared as features?
Yes, there are various cases where an 'equivalent' of an FS addressis required (for example if the LL API is being used). In this casethe id gets allocated on the fly and will subsequently be unique tothat FS within the CAS. In many cases an FS might never have suchan ID allocated (it's not really part of the non-LL "public" APIs),but you can always 'request' one.
I imagine that IDs would be necessary to implement stuff likedelta-CAS later on too.
Are any of the changes so far in any way related to potentiallyallowing additions to the type system at runtime?

Not directly related; my goal was just to make the implementationfunctionally equivalent but threadsafe (and simpler, faster).But it's possible (not certain) this new impl may impose fewerbarriers to enabling such capability.

What would be the incentive/benefit for the developer of aUIMA-based framework/applications or for the users of suchframeworks/applications to switch to the new implementation?

That was the "summary of advantages" I had in the original email, I'veincluded it again below. The primary "external" benefits I think arethe CAS being thread-safe and faster to manipulate. I understand thatmany users/developers might not care about these things, just as theylikely wouldn't care about the code footprint or complexity of theinternals, but it also shouldn't adversely impact them to "upgrade" toa new UIMA version based on this implementation.

I feel that not being able to have more than one thread work on a CASat the same time is a major limitation, especially given modernsystems typically have many CPU cores.

- Drastic simplification of code - most proprietary data structureimpls removed, many other classes removed, index/index repo impls areabout 25% of the size of the heap versions (good for futureenhancements/maintainability)- Thread safety - multiple logically independent annotators can workon the same CAS concurrently - reading, writing and iterating overfeature structures. Opens up a lot of parallelism possibilities- No need for heap resizing or wasted space in fixed size CAS backingarrays, no large up-front memory cost for CASes - pooling them shouldno longer be necessary- Unlike the current heap impl, when a FS is removed from CAS indicesit's space is actually freed (can be GC'd)- Unification of CAS and JCas - cover class instance (if it exists)"is" the feature structure- Significantly better performance (speed) for many use-cases,especially where there is heavy access of CAS data- Usage of standard Java data structure classes means it can benefitmore "for free" from ongoing improvements in the java SDK and fromhardware optimizations targeted at these classes


Cheers,

-- Richard

Re: Alternate CAS implementation

Reply via email to