Re: Alternate CAS implementation

Nick Hill Sun, 05 Apr 2015 16:57:06 -0700

First I just want to emphasize that this proposal doesn't necessarilyhave full endorsement yet from Marshall, Eddie, et al. so I won'tcomment on the strategic roadmap questions. I'm just attempting tomake the case for a direction which to me seems very natural.

Logically UIMA is an object graph with some fixed type system, rootedin one or more collections with different ordering/uniqueness rules.These are all things Java provides out-of-the box, with a highlyevolved heap/GC engine and numerous powerful SDK collection classes.Thus I would argue the most simple and "obvious" way to implement theUIMA specification would be a thin layer on top of these existingconstructs, and there would need to be quite a compelling reason todeviate significantly from this.

I completely understand that in the past such reasons existed, butit's now 10 years later and JVMs and hardware have moved onconsiderably. I'm fairly certain that those reasons are no longervalid today, and yet we are still paying the cost of significantcomplexity for something with more limitations than we would haveotherwise. By carving out static chunks of heap and doing aproprietary form of "memory management" within them it prevents theJVM from optimizing how/where this data is stored and collected. Thisis very much working against how the JVM was designed to be used.

The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic basedCAS manipulation, so my understanding is that the only reason forperforming low level or direct heap access is to get betterperformance out of the current array-based impl. I was assuming suchusage would be rare by users of UIMA in general, but I could be wrongand it's useful to know people out there like you are doing it. I'dargue though the fact that it is even necessary is another reason forchanging the approach (given that a goal of UIMA is to minimize effortrequired by NLP developers). Could you elaborate on your usage ofinternal APIs?

Regarding serialization formats, each of them is just a well-definedserial representation of a CAS, so should not be affected by theruntime implementation.I do understand that the current binary formats derive from the CASarray internals, but it doesn't mean that this new impl couldn'tread/write that same format. I expect here specifically there may be arelative performance impact because of the 'reconstruction' of theheaps that would be needed, however:- In a way, keeping the CAS in this form in memory could be seen asoptimizing for speed of this specific binary format at the expense ofslower and less flexible runtime CAS access- Alternative binary serialization mechanisms (and formats) could alsobe used similar to standard java object serialization, which I expectwould be just as fast (although not default java serialization whichis very inefficient)- I'd question in any case whether this alone should dictate theoverall architecture choice

I was my impression in the past, that UIMA-Core has always valuedcompatibility very high, even to the point of adding switches tore-enabled buggy/undesired behavior in case somebody depended on it.

I understand this, and I think I managed to keep things functionallyidentical. I'm not proposing any change in behaviour.

Changing the implementation of the CAS is probably the most radicalidea I've seen so far in this project.

It might be radical in terms of the implementation change but again Iwould argue it's really just a simplification of the internals. Itshouldn't be radical at all for users of UIMA in general.

Are we going to slay the holy cow of compatibility now and if yes atwhich levels?

Even for this change, the source incompatibility only applies to JCascover classes, and only to those because of their currentimplementation-specific format.

What does such a change mean to the various sub-projects (DUCC,UIMA-AS, RUTA, uimaFIT)?

As long as these projects don't directly manipulate the CAS arrays,there should be zero impact to them apart from I would hope someperformance benefits. It would also mean in future they could exploitthe thread-safe nature of the CAS for various purposes.


Regards,
Nick

Quoting Richard Eckart de Castilho <[email protected]>:

On 03.04.2015, at 22:51, Marshall Schor <[email protected]> wrote:
It may be good to open a "Brainstorming" Jira, and attach the code you're
thinking of donating, so that people could study it and have a more concrete
idea about this.
If it eventually gets accepted, we would also need a Software Grantfor this, I
think, due to the size.
I was my impression in the past, that UIMA-Core has always valuedcompatibility very high, even to the point of adding switches tore-enabled buggy/undesired behavior in case somebody depended on it.Changing the implementation of the CAS is probably the most radicalidea I've seen so far in this project. In principle, I very muchlike seeing UIMA to evolve, but I do wonder how such a radicalchange is imagined to be undertaken.
I'm aware that there are various levels of compatibility. Myimpression so far was that source-compatibility was typically notsufficient in the past.
Are we going to slay the holy cow of compatibility now and if yes atwhich levels?
Is there some willingness now to consider setting up a road-map fora UIMA-Core version 3?
What does such a change mean to the various sub-projects (DUCC,UIMA-AS, RUTA, uimaFIT)?
Personally, I'd be curious to see how much of e.g. DKPro Core orWebAnno breaks with such a new implementation. I imagine quite a lotsince I've become quite fond of binary serialization and internalAPI usage lately (in some cases I might be able to switch toofficial low-level CAS API...). Although I'm very much for evolutionand adopting newer technologies, I'm afraid testing this (andpotentially fixing stuff) will be quite time intensive. Given thatin my context, most of the benefits are not very relevant so far,such testing would only make sense to me if it was part of a largerstrategic change - and I think that a properly licensed contributionwould be pretty much a pre-requisite to even look at it in detail.
Marshall, Eddie, and Nick do you have some vision of a strategicUIMA roadmap that you can share with us?
Cheers,

-- Richard

Re: Alternate CAS implementation

Reply via email to