Thanks Richard, have put some responses inline below.
Quoting Richard Eckart de Castilho <[email protected]>:
Hi Nick,
sounds like a very interesting direction you are moving in here.
From my point of view, it would be nice if it was possible to
configure the UIMA framework to produce either this new kind of CAS
or the old one without having to exchange a JAR - doing so
statically at initialization time or even dynamically at runtime.
E.g. to allow easily running test cases against both implementations.
When you say "produce", there shouldn't be any visible difference in
anything output or persisted, the impl is just how the CAS is stored
internally in memory while processing is happening.
It won't be possible to switch the impl being used at runtime. There
are classes for example with the same names but different impls (e.g.
CASImpl). I know this isn't ideal for tests/comparisons between the
two impls but quite a lot of things are currently tightly-coupled to
the heap internals and so switching a jar doesn't seem too big a price
to pay given no other code changes are needed.
A branch doesn't sound as attractive to me as I think it increases
the risk of making changes specific to this new kind of CAS that are
incompatible to the old one.
It's really not a new kind of CAS. Again "externally" there shouldn't
be visible differences and so the only kind of changes that would
apply here would be to public interfaces, for example some new method
in the CAS interface - it would need to be implemented by the new impl
too while both existed.
My branch suggestion was more just to ask what might be a good way to
share the code for others to look at, try out, and contribute to.
Having to recompile the JCas classes is a bit of a blocker to me -
but I remember that Marshall was contemplating about a way to
generate JCas classes at runtime, so this might just be a temporary
blocker.
When I say recompile, I don't mean regenerate using JCasGen, just
recompile .class files from the existing jcas .java files. I would
expect that you would typically only be using one version (other than
for comparison purposes - to validate functional equivalence and/or
compare performance), and so this isn't something that would need to
be done often.
Also, in my context, we tend to rely quite heavily on binary
serialization - all kinds thereof, starting with the
CasCompleteSerializer up to the recent binary forms (specifically 6).
I didn't mean to suggest that it wouldn't be possible to support this,
just that the work hasn't been done yet. As mentioned this started
really just as an experiment.
I'd expect that a simpler binary format would be possible that wasn't
tied to the specific heap-based impl and maybe closer in impl to how
java's native serialization works, but of course we would want to
support the existing formats too.
In one context, we also rely heavily on CAS addresses serving as
unique identifiers of feature structures in the CAS. Does your
implementation provide any stable feature structure IDs, preferably
ones that are part of the system and not actually declared as
features?
Yes, there are various cases where an 'equivalent' of an FS address is
required (for example if the LL API is being used). In this case the
id gets allocated on the fly and will subsequently be unique to that
FS within the CAS. In many cases an FS might never have such an ID
allocated (it's not really part of the non-LL "public" APIs), but you
can always 'request' one.
Cheers,
-- Richard
On 01.04.2015, at 08:03, Nick Hill <[email protected]> wrote:
Hi all, I work with Marshall and Eddie and have been using UIMA for
some time but am new to the mailing list.
As an experiment, I re-implemented the (java) CAS internals such
that each feature structure corresponds to a single java object
instead of using the custom "heaps" (monolithic arrays), and
indices are built from standard java SDK (concurrent) collection
classes.
The original motivation was to make the CAS threadsafe but I think
there are other benefits, the biggest of which may be
reduction/simplification of the codebase.
This new impl should be fully compatible with all of the existing
CAS APIs, with a few exceptions (see below). i.e. in most cases it
can be a drop-in replacement for uima-core.jar. Existing JCas cover
classes can be used but must be recompiled. I also included a
"compatibility layer" for the low level CAS API so that existing
usage of it should still work, but removing the heaps of course
obviates the need for it.
Summary of advantages:
- Drastic simplification of code - most proprietary data structure
impls removed, many other classes removed, index/index repo impls
are about 25% of the size of the heap versions (good for future
enhancements/maintainability)
- Thread safety - multiple logically independent annotators can
work on the same CAS concurrently - reading, writing and iterating
over feature structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS
backing arrays, no large up-front memory cost for CASes - pooling
them should no longer be necessary
- Unlike the current heap impl, when a FS is removed from CAS
indices it's space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists)
"is" the feature structure
- Significantly better performance (speed) for many use-cases,
especially where there is heavy access of CAS data
- Usage of standard Java data structure classes means it can
benefit more "for free" from ongoing improvements in the java SDK
and from hardware optimizations targeted at these classes
Functionality not yet supported:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)
- Snapshot iterators currently return regular iterators (but all
iterators are safe to use concurrently with modification)
- Multiple classloaders haven't been tested
There's also various other small loose ends and cleanup to do.
I was hoping to see if there's interest from the community in
taking this further, maybe even as a replacement for the current
impl in a future version of uima-core.
I'm not sure of the best way to share the code, but it would be
great to have a branch in the shared SCM repo where the current
prototype could be reviewed and collaboratively evolved to fill the
remaining gaps.
Would welcome any comments or questions!
Thanks,
Nick