Re: Alternate CAS implementation

Nick Hill Tue, 07 Apr 2015 15:20:57 -0700

Of course, I'm in the same boat! :-)

Regards,
Nick


Quoting Marshall Schor <[email protected]>:

OK, thanks!  Please be patient - volunteers at work (who have other
"day" jobs
:-)  )

-Marshall

On 4/7/2015 4:18 PM, Nick Hill wrote:

Per Marshall's suggestion I've created a "brainstorming" jira issue and
attached the current prototype code:

https://issues.apache.org/jira/browse/UIMA-4329

Regards,
Nick

Quoting Peter Klügl <[email protected]>:

Hi Nick,

I am (of course) also interested in the alternate CAS implementation.

I agree with Marshall that the code should be attached to an jira issue so
that we can take a closer look and investigate its impact for the vairous
tools and libraries (in my case UIMA Ruta).

Best,

Peter

Am 06.04.2015 um 02:04 schrieb Nick Hill:

First I just want to emphasize that this proposal doesn't necessarily have
full endorsement yet from Marshall, Eddie, et al. so I won't
comment on the
strategic roadmap questions. I'm just attempting to make the case for a
direction which to me seems very natural.

Logically UIMA is an object graph with some fixed type system,
rooted in one
or more collections with different ordering/uniqueness rules.
These are all
things Java provides out-of-the box, with a highly evolved heap/GC engine
and numerous powerful SDK collection classes. Thus I would argue the most
simple and "obvious" way to implement the UIMA specification
would be a thin
layer on top of these existing constructs, and there would need
to be quite
a compelling reason to deviate significantly from this.

I completely understand that in the past such reasons existed,
but it's now
10 years later and JVMs and hardware have moved on considerably.
I'm fairly
certain that those reasons are no longer valid today, and yet we are still
paying the cost of significant complexity for something with more
limitations than we would have otherwise. By carving out static chunks of
heap and doing a proprietary form of "memory management" within them it
prevents the JVM from optimizing how/where this data is stored and
collected. This is very much working against how the JVM was
designed to be
used.

The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic based CAS
manipulation, so my understanding is that the only reason for
performing low
level or direct heap access is to get better performance out of
the current
array-based impl. I was assuming such usage would be rare by users of UIMA
in general, but I could be wrong and it's useful to know people out there
like you are doing it. I'd argue though the fact that it is even necessary
is another reason for changing the approach (given that a goal of
UIMA is to
minimize effort required by NLP developers). Could you elaborate on your
usage of internal APIs?

Regarding serialization formats, each of them is just a
well-defined serial
representation of a CAS, so should not be affected by the runtime
implementation.
I do understand that the current binary formats derive from the CAS array
internals, but it doesn't mean that this new impl couldn't read/write that
same format. I expect here specifically there may be a relative
performance
impact because of the 'reconstruction' of the heaps that would be needed,
however:
- In a way, keeping the CAS in this form in memory could be seen as
optimizing for speed of this specific binary format at the
expense of slower
and less flexible runtime CAS access
- Alternative binary serialization mechanisms (and formats) could also be
used similar to standard java object serialization, which I
expect would be
just as fast (although not default java serialization which is very
inefficient)
- I'd question in any case whether this alone should dictate the overall
architecture choice

I was my impression in the past, that UIMA-Core has always valued
compatibility very high, even to the point of adding switches to
re-enabled
buggy/undesired behavior in case somebody depended on it.


I understand this, and I think I managed to keep things functionally
identical. I'm not proposing any change in behaviour.

Changing the implementation of the CAS is probably the most radical idea
I've seen so far in this project.


It might be radical in terms of the implementation change but
again I would
argue it's really just a simplification of the internals. It shouldn't be
radical at all for users of UIMA in general.

Are we going to slay the holy cow of compatibility now and if
yes at which
levels?


Even for this change, the source incompatibility only applies to
JCas cover
classes, and only to those because of their current
implementation-specific
format.

What does such a change mean to the various sub-projects (DUCC, UIMA-AS,
RUTA, uimaFIT)?


As long as these projects don't directly manipulate the CAS arrays, there
should be zero impact to them apart from I would hope some performance
benefits. It would also mean in future they could exploit the thread-safe
nature of the CAS for various purposes.

Regards,
Nick

Quoting Richard Eckart de Castilho <[email protected]>:

On 03.04.2015, at 22:51, Marshall Schor <[email protected]> wrote:

It may be good to open a "Brainstorming" Jira, and attach the
code you're
thinking of donating, so that people could study it and have a
more concrete
idea about this.

If it eventually gets accepted, we would also need a Software Grant for
this, I
think, due to the size.


I was my impression in the past, that UIMA-Core has always valued
compatibility very high, even to the point of adding switches to
re-enabled
buggy/undesired behavior in case somebody depended on it. Changing the
implementation of the CAS is probably the most radical idea I've seen so
far in this project. In principle, I very much like seeing UIMA
to evolve,
but I do wonder how such a radical change is imagined to be undertaken.

I'm aware that there are various levels of compatibility. My
impression so
far was that source-compatibility was typically not sufficient
in the past.

Are we going to slay the holy cow of compatibility now and if
yes at which
levels?

Is there some willingness now to consider setting up a road-map for a
UIMA-Core version 3?

What does such a change mean to the various sub-projects (DUCC, UIMA-AS,
RUTA, uimaFIT)?

Personally, I'd be curious to see how much of e.g. DKPro Core or WebAnno
breaks with such a new implementation. I imagine quite a lot since I've
become quite fond of binary serialization and internal API usage
lately (in
some cases I might be able to switch to official low-level CAS API...).
Although I'm very much for evolution and adopting newer technologies, I'm
afraid testing this (and potentially fixing stuff) will be quite time
intensive. Given that in my context, most of the benefits are not very
relevant so far, such testing would only make sense to me if it
was part of
a larger strategic change - and I think that a properly licensed
contribution would be pretty much a pre-requisite to even look at it in
detail.

Marshall, Eddie, and Nick do you have some vision of a strategic UIMA
roadmap that you can share with us?

Cheers,

-- Richard

Re: Alternate CAS implementation

Reply via email to