Re: Alternate CAS implementation

Nick Hill Sun, 26 Apr 2015 16:58:27 -0700

Thanks Marshall, I'm also learning alot too... And I don't mean tocome across as critical of the code itself, which I've been impressedby.

I think we are viewing things from different angles. I feel thatoverall simplicity and alignment with "standard" javafeatures/practices are valuable enough to think of as a primary goal,from which deviations/additions in terms of complexity and alternativedatastructures are made as and where necessary based on meetingrealistic usage requirements.

I can absolutely understand why custom heaps would have been needed inthe past, when obj creation was expensive, memory was moreconstrained, etc.But as I've said I don't think that is the case anymore, and if/wherean object-based design falls short, there are simpler adjustments thatcould close the gap. In particular there's many potentialimprovements/refinements possible to the initial prototype includingreduction in num of bytes per FS.Aligning with the JVM also means much more free benefit from ongoingimprovements to it's implementation and that of the SDK, which are allaimed at code using objects in a standard way. Java 8 includes manysuch improvements, including to the impl of classes leveraged in theobject version.

I don't think the starting point should be to minimize the number ofbytes used on the heap at all costs. Of course by this measure custom"packing" is going to score higher because of per-object allocationoverheads, but it seems unlikely the magnitude of the difference wouldbe worth the cost for typical use cases.

Similarly in my view it would be better to make CPU cache localityconsiderations in the context of an more OO-centric design rather thanas a central goal of the design.I have a feeling that even though in a vacuum specific kinds ofoperation would be much faster because of the cache locality, in mostreal life use cases the aggregate performance would be more thanoffset by additional cost of indirection from not using objectsdirectly (and I think I've observed this).

The above applies primarily to the FS heap structures. With respect toindices, I think modifiying the current impls to work directly withobject references instead of ints is a great idea. This would make ittrivial to "plug and play" different index implementations withobj-based feature structures (for example in my existing prototype).These could be selected at runtime based on the usecase (e.g. via ajvm arg)... so that if thread-safety was required the SDKcollection-based one could be used, or if heap usage minimization wasthe highest priority then the specialized impl could be chosen. Itwould be easy to evaluate the perf of a given use case with each impl.

Re "int" FS forms, do you agree the only real need for this if usingan obj-based impl are the three things I previously enumerated (i.e.backwards compatibility with LL APIs, serialization formats,consistent ordering of different FS's which are otherwise "equal", andcorrelating FS identity across serialization)?If so I think there shouldn't be much perf impact due to this. Itmight still be nice to allocate a unique int to every FS and storingit in a field rather than allocating as-needed in a separate map asI'm doing currently. A map would still need to be maintained for thereverse lookup though (but with soft refs to the FS objs!).

I think this is great discussion, but many of our points (mine too)are theoretically motivated and may be somewhat speculative from acurrent "real life" perspective?

Would you consider at least keeping the object version in "incubation"so it can be further refined and to see if some of our speculationscan be tested in practical situations (which I likely have anincomplete picture of)?

On the field-per-feature JCas idea - I do like this in principle andagree the benefits you mentioned could be significant, but thecompatibility thing seems like a major inhibitor.


Regards,
Nick

Quoting Marshall Schor <[email protected]>:

Thanks for the comments, Nick. Ever since you've become involved,I've started
learning a lot more about Java high performance and parallel programming :-).

I've given some responses below.

On 4/18/2015 6:36 PM, Nick Hill wrote:
Thanks Marshall, have added some responses inline below


Quoting Marshall Schor <[email protected]>:
Re: supporting multiple implementations of the CAS.

The original implementation of the CAS picked a set of space/performance
trade-offs. Some of these were motivated by earlier frameworksbuilt in C++, bythe state-of-affairs of early Java implementations, etc. The goalwas to createsomething that would be attractive to use by the community ofpeople working in
the field of NLP, who were often concerned with these kinds of things.
Later, JCas was added, to make things easier for peoplecomfortable in Java.And later still, uimaFIT added more convenience things from theworld of Javaand related technologies (e.g. Spring, dependency injection,running without
XML, etc.).
Still later, the platform has started paying more attention tooptimizationsaround multi-core and L1/L2/L3 memory hierarchies, as thosetechnologies became
much more prevalent.
All along, there was close attention paid to backwardscompatibility; a main
reason was to create an "investable" platform - one where developers could
"invest" work in, and expect their work to have a long, usefullife, even as the
framework might evolve to keep up with hardware and software changes.

Another part of making UIMA an attractive place to invest work in annotator
development was the possibility of first developing your annotator in an
easy-to-use paradigm, and later "optimizing" it for speed / space.This, forexample, recently happened in version 2.7.0 with the release of an"upgrade" tothe CasCopier. The first version was completed using normal JavaAPIs to theCAS, and served for several years. When some applications begannoticing thiswas becoming a bottleneck, an optimization was done which a)greatly speeded it
up, and b) used much less Java heap space in the process, principally by
replacing CAS and index access and minipulations with their so-called
"low-level" equivalents. These low-level equivalents are therejust for thisreason; they typically create and use no Java objects at all, andcan be much
faster.
I'm not questioning historical design choices, I understand various things
were different in the past and that it has been an evolution; rather I'm
suggesting that the design as it stands now doesn't make sense.

I'm proposing the fact that complex low level API usage is needed by
developers to get better performance is primarily an artifact of the custom
heap implementation itself.

In other words, we're providing an implementation which is more complex and
slower than a simple object-based one and then saying if you want comparable
performance you need to rewrite your code using the cumbersome lowlevel APIs.
These are interesting points, and are driving some investigations into seeing
*why* the performance (space and time) of the object-based one isbetter in somecases. The results of this investigation (which hasn't been donebefore due to
lack of focus/interest) could result in "fixes" that make the current
implementation faster, perhaps even substantially, than the object one.
Opinions on this vary :-), but some underlying fundamentals lead me to think
this is quite possible. These include well-known aspects of spaceoverhead for
Java objects, and the effect of compact representations on memory caching
aspects in modern computing (leading to favoring array-like data structures
versus representations based on linked lists).

One thing that was totally a surprise to me, and may yield some significant
performance benefits, is to untangle the various interfaces andclasses aroundindexes and iterators, in a particular good (performant) way. Itried one such
untangling - taking the two concepts of "ComparablePointerIterator" (an
IntIterator which can be compared with another IntIterator) and theconcept ofbeing able to report ConcurrentModificationExceptions, and splitthese into two
interfaces (they were previously combined in the comparablePointerIterator
interface). This refactoring, by itself, resulting in theperformance on bothour test suite and a test run on a highly parallel large scalepipeline, to drop
by 6 to 16%.  (So I of course will not do that change...).
Would it not be better if the easy-to-use paradigm was already fast enough?
Of course. The problem though, it seems to me, is that fast enoughis a moving
target :-).
The CASCopier low-level rewrite example is a very good illustration - where
better performance was achieved *despite* the underlying impl, notbecause of
it. The obj-based impl still uses the normal Java APIs (basically the
pre-rewrite CASCopier logic), and the "as-is" raw copying speed isvery closeto the super-optimized low level one(*). In a practical contextwhere copyingis done in conjunction with other CAS access, the obj impl appearsmuch faster
on aggregate.
On the backwards compatibility point - as already discussed in thethread what
I'm proposing should have minimal if any impact on existing developer
investment. It's now been tested with various apps/frameworks without code
changes being needed (uimaFIT, Ruta, hopefully DKPro once binaryserialization
compatibility is there,..)

(*) This statement applies to copying with a shared typesystem. At first
glance it looks like some of the CASCopier optimization done was aimed
specifically at speeding up aspects of cross-typesystem copying which are
independent of the LL CAS aspects and I assume should be simple to transfer
over to the new impl.
I think Nick has several interesting ideas, all bundled up together in a
particular set of design choices, for an alternative CAS design.  At a high
level, I think these are:

1) having the main data storage be within individual sets of Java Objects
(multiples for each Feature Structure Instance), and letting Javamanage the
space allocation and reclamation (via garbage collection) of these.
2) having the indexes index these Java structures (vs in the core:they index
offsets in the heap, represented as "ints").
3) having the index structures themselves do a different trade-offof space,time, and concurrency support. In general, the trend seems to beless concern
for space (as the cost per bit has dropped faster than the cost per
computation), and towards supporting more concurrency (as theability to run
multiple threads in parallel has grown).
4) Making more use of "standard" (but possibly new/evolving)capabilities in
core Java, instead of doing lots of custom one-of-a-kind Java code.  In
general, I think this is a very good idea :-).  I foresee shifts in this
direction where
possible, but probably incrementally, in the core UIMA implementation.
I think these high level concepts have valuable ideas to consideraugmenting
core UIMA with.  And the whole package might be, together, an interesting
design point for some users.
I don't think the ideas are so independent though. If you start with the
assumption that standard objects are a better choice than custom heaps, then
simple index implementations based on standard Java collectionimpls are also
a natural thing to do.

Given that, I don't really understand how the existing impl would be
"augmented", could you elaborate on this? Which bit of what you summarized
would be considered disadvantageous and excluded? In my view all of these
things are advantageous, so why limit the improvement/simplification?
Here are some early thoughts; I suspect more will surface as the object
implementation details are explored.
The current design supports Feature Structure references whichincludes a Java
cover class (either a JCas cover object, or a basic Java cover object having
only generalized methods for accessing Features), and a low levelreference, an
"int".
This "dual" nature requires conversion between these two forms;going from the
int to the Java class form requires either generating a new Java object each
time this is done, or saving a previously generated object, and"looking it up"and reusing it (this is what's normally done for the JCas coverobjects). Thisis a cause of some amount of "overhead", which could be eliminatedby moving toan approach which puts the main impl as Java cover objects, and"materializes"as needed, the int form (for backward compatibility, as you do).One would need
to be aware of performance issues for code that used the int form - this code
would likely run quite a bit slower.
The indexing support currently builds indexes around the int form ofreferenceto the Feature Structures. This could be changed to build themaround the Javaform of reference. This would have the benefit of eliminating theconversions
from the int form to the Java form.

The actual index implementation is a complex trade off between accessing,
iterating, and updating.  The current implementation stores indexes in a
multitude of optimized data structures, which attempt to balance the trade
offs.  This includes the famous rattling iterators - a design choice made to
have updates to a particular type affect index structures only for that type
(and not for all types), which, in turn, was traded off for array-like data
structures to store things (except for Sets). This design choiceled, though, to
the need to have iterators over a type and all its subtypes, for the Sorted
case, to have complex logic that set up iterators for all thesubtypes and the
type, sort those, and keep them in sorted order as the iterator moved through
the collection.  In some use cases, we've seen where there are 100's or even
1000's of subtypes, and the maintenance of these lists of iterators becomes a
noticable overhead. Jirahttps://issues.apache.org/jira/browse/UIMA-4356 speedsthings up noticably when a significant number of these subtypeindexes are empty.
I also think that many existing (and future) users like the ideaof a platform
which supports a sliding scale of space / time / optimization tradeoffs as
represented by the current UIMA design, so I don't currently thinkit's a goodidea to drop the current UIMA internal design in favor of this newdesign point.
Could you expand on the space / time / optimization tradeoffs you have in
mind? I think it's a big mis-assumption that the current custom heaps and
index implementations provide any meaningful benefit in terms ofspeed/memory
usage over a simpler object impl.
In my experiments so far with some real-life usage, the obj-basedimpl appearsto be better in terms of both speed and space. I'm also highlyskeptical that
there exist any real-life use cases where the magnitude of the "space"
difference is meaningful (whether that's a net increase or reduction).

In other words, what benefits would be "given up" by abandoning the custom
heaps/indices? If the object based impl satisfies (or can be easily made to
satisfy) the vast majority of practical use cases better, what'sreally beingtrading off and why would one choose to retain all of the complexcustom baggage?
In particular, what is the sliding scale of tradeoffs which the current impl
provides? I actually think the object based approach makes it simpler to
build/plug in different index datastructures that could provide usecase
specific tradeoffs (grouping by type within sorted indices being a prime
example). It's also trivial to swap concurrent collections with their
non-concurrent counterparts where threadsafety isn't required, etc.
There are other design points/choices that could be considered.For example,
with today's technology, I think it is quite feasible to create Feature
Structures as Java objects where the features are "fields" in theJava object.
This is enabled by the ability to compile Java classes as part of
the startup of
application instance. I'm thinking along these lines: the currentapproach to
UIMA Type merging would be followed by a similar JCas cover class (optional
creation) and merging, followed by compiling the JCas cover classes during
startup. This could be a kind of just-in-time (JIT) running ofJCasGen at thestart of every run, on the fully merged type system. (I'm surethere's issues I
haven't thought of; this is just the beginnings of an idea :-) ).
This is an interesting idea, but introducing on-the-fly code generation and
compilation as a standard runtime step sounds a bit precarious (butmaybe I'm
wrong). Would user code also require recompilation using the dynamically
generated classes?
It shouldn't.
I also can't envisage how it would retain backwards compatibility with
existing code (which I've learnt is a pretty big deal!),particularly since asyou know there are some contexts where custom modifications togenerated JCas
classes are used extensively.
The generation of JCas cover classes has always allowed for custommodificationsin the source. These are (and would continue to be) "merged" in aspart of JCasGen.
Furthermore it's also not clear how much benefit it would give beyond the
obj-based impl already done. From a CAS access point of view there would be
slightly less indirection by removing the arrays, but in generic APIs
reflection would be required. So I think we'd want to be confident this net
speed advantage was non-negligible before paying the above penalties. As
mentioned earlier, I'm fairly sure any additional "space saving" wouldn't be
significant in practical contexts.
I feel that there is a very wide usage of UIMA in many differentcontexts, sinceit's been out there for a decade. The space saving may besignificant in some
of the existing use cases.
Another reason space saving may be significant involves modern chip memory
management, where there's multiple levels of caching.  IBM's P7, for example,
has an L1 cache size of 32K, and I think uses cache lines of 128bytes, giving
256 cache slots in L1; L2 .  (see
http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf
). Having fewer Java objects, with less pointer-dereferencing amongthem, canresult in much faster execution. I'm reminded of a problem noticedin our form
6 deserialization speed when IBM Java was used - it was ~ 3x slower than the
Oracle Java. This was eventually traced to a buffer spec for thebuilt-in Java
unzipping code, see Jira https://issues.apache.org/jira/browse/UIMA-4283 .
Changing a buffer from a max of 32K to 1K (shrinking it) made theIBM Java run
the same as the Oracle one.  I think the unzipping code was "flushing" the L1
cache, and causing this effect (which was very big, for this part ofthe system).
-Marshall

On 4/2/2015 3:55 PM, Nick Hill wrote:
Thanks Richard, more replies below...

Quoting Richard Eckart de Castilho <[email protected]>:
Hi Nick,

On 02.04.2015, at 01:37, Nick Hill <[email protected]> wrote:
From my point of view, it would be nice if it was possible toconfigure the
UIMA framework to produce either this new kind of CAS or the old one
without having to exchange a JAR - doing so statically atinitialization
time or even dynamically at runtime. E.g. to allow easily running test
cases against both implementations.
When you say "produce", there shouldn't be any visible difference in
anything output or persisted, the impl is just how the CAS is stored
internally in memory while processing is happening.

It won't be possible to switch the impl being used at runtime. There are
classes for example with the same names but different impls(e.g. CASImpl).
I know this isn't ideal for tests/comparisons between the two impls but
quite a lot of things are currently tightly-coupled to the heapinternalsand so switching a jar doesn't seem too big a price to paygiven no other
code changes are needed.
What do you plan to be the ultimate goal of this experiment? Isit to support
different CAS implementations or is it to replace the existing CAS
implementation with a totally different one?
Most things in UIMA are created through factories (not the CASso far). So
theoretically, one could replace most classes by custom classes by
reconfiguring the framework to use different factory classes orhaving thefactories produce different implementations. Can you imaginethat as well for
the CAS?
For users the implementation shouldn't matter. They shouldn't observe any
functional difference and therefore shouldn't really care if theimpl changesunderneath. All consuming code should work as-is, with theexception of code
which accesses 'internals' directly - but I'd see this as analogous to
accessing private fields in some java SDK class, which breaks when those
fields change in a newer SDK version.

As such I don't think it would make sense (or be very practical from a
maintenance pov) to support two implementations concurrently or to have a
factory.
Does it mean that the UIMA-C++ implementation is going to be discontinued
officially?
No, just to clarify no agreements or plans have been made. I justwanted to
initiate a discussion around this as a possible idea.
If we were to pursue this alternate implementation, I don't knowof any reasonwhy the C++ impl would be discontinued. I had just listed C++ AEsas one of
the things which don't yet work with my current prototype.
Having to recompile the JCas classes is a bit of a blocker tome - but I
remember that Marshall was contemplating about a way to generate JCas
classes at runtime, so this might just be a temporary blocker.
When I say recompile, I don't mean regenerate using JCasGen,just recompile
.class files from the existing jcas .java files. I would expect that you
would typically only be using one version (other than for comparison
purposes - to validate functional equivalence and/or compareperformance),
and so this isn't something that would need to be done often.
Compiled JCas classes tend to be shipped as part of frameworks.This meansthat it will not be possible to switch to a new CAS impl just byreplacing aJAR. It will also mean that components from different UIMA-basedframeworkscannot be mixed and matched anymore unless some broker likeUIMA-AS is used.
The current JCas cover class format is quite old andtightly-coupled to the
heap-based CAS internals. Saying that all new versions of UIMA must be
binary-compatible with these therefore imposes a (somewhat crippling)
restriction on possible internal improvements. You might say thatthe current
JCas classes break standard abstraction/encapsulation principles if the
expectation is they will be forever forwards binary-compatible.
It would not be hard on the UIMA side to move to a simpler andmore abstract
JCas cover class format that should avoid this problem in future, but the
actual move to such a format would be even more disruptive thanrequiring arecompilation (would require a re-JCasGen), and would have thesame issues you
mention above.

I managed to make this object-based impl at least source-compatible with
existing jcas cover classes, by 'converting' the impl of methodscalled that
were intended to make CAS heap changes to actually be manipulating the FS
objects directly.
In one context, we also rely heavily on CAS addresses serving as unique
identifiers of feature structures in the CAS. Does your implementation
provide any stable feature structure IDs, preferably ones thatare part of
the system and not actually declared as features?
Yes, there are various cases where an 'equivalent' of an FS address is
required (for example if the LL API is being used). In thiscase the id getsallocated on the fly and will subsequently be unique to that FSwithin the
CAS. In many cases an FS might never have such an ID allocated (it's not
really part of the non-LL "public" APIs), but you can always'request' one.
I imagine that IDs would be necessary to implement stuff likedelta-CAS later
on too.

Are any of the changes so far in any way related to potentially allowing
additions to the type system at runtime?
Not directly related; my goal was just to make the implementationfunctionally
equivalent but threadsafe (and simpler, faster).
But it's possible (not certain) this new impl may impose fewer barriers to
enabling such capability.
What would be the incentive/benefit for the developer of a UIMA-based
framework/applications or for the users of suchframeworks/applications to
switch to the new implementation?
That was the "summary of advantages" I had in the original email, I've
included it again below. The primary "external" benefits I thinkare the CAS
being thread-safe and faster to manipulate. I understand that many
users/developers might not care about these things, just as they likely
wouldn't care about the code footprint or complexity of theinternals, but italso shouldn't adversely impact them to "upgrade" to a new UIMAversion based
on this implementation.
I feel that not being able to have more than one thread work on aCAS at the
same time is a major limitation, especially given modern systems typically
have many CPU cores.

- Drastic simplification of code - most proprietary data structure impls
removed, many other classes removed, index/index repo impls areabout 25% ofthe size of the heap versions (good for futureenhancements/maintainability)- Thread safety - multiple logically independent annotators canwork on the
same CAS concurrently - reading, writing and iterating over feature
structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CASbacking arrays,
no large up-front memory cost for CASes - pooling them should no longer be
necessary
- Unlike the current heap impl, when a FS is removed from CAS indices it's
space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if itexists) "is" the
feature structure
- Significantly better performance (speed) for many use-cases, especially
where there is heavy access of CAS data
- Usage of standard Java data structure classes means it canbenefit more "for
free" from ongoing improvements in the java SDK and from hardware
optimizations targeted at these classes
Cheers,

-- Richard

Re: Alternate CAS implementation

Reply via email to