[
https://issues.apache.org/jira/browse/UIMA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486883#comment-14486883
]
Richard Eckart de Castilho commented on UIMA-4329:
--------------------------------------------------
I'm fooling around a bit with the alternative uimaj-core.
For my purposes, I changed the pom.xml in your file so that the artifactId is
"uimaj-core" and the version is "2.7.1-nick-snapshot". That makes it easier for
me to update just the core JAR via Maven dependency management mechanisms for
multi-module projects
It appears there may be changes in the file that do not pertain to the
alternative CAS. One test does not compile because a
"AnalysisEngineManagementImpl.getRootName("bar")" method is missing.
In WebAnno [1], we heavily rely on LowLevelCas, CAS addresses, and binary
serialization in all forms. That shows when dropping in the alternative CAS
impl:
* the method getAddress() is undefined for <JCas class> - we need a stable
identifier for feature structures that also can remain stable across
serialization (for this we currently rely on CASCompleteSerializer, because
binary form 6 doesn't keep the addresses stable)
* the method getLowLevelCas() is undefined for type JCas - we use this as an
alternative way to access/resolve CAS addresses
* import org.apache.uima.cas.impl.CASCompleteSerializer cannot be resolved -
used as a fast serializiation that maintains CAS addresses
* import org.apache.uima.cas.impl.Serialization cannot be resolved - used as a
fast serializiation that maintains CAS addresses
We decided to use the CAS addresses because they offer a fast and convenient
way of random access to feature structures in the CAS and because we didn't
have to mingle with the type system. In this way, the type system can be kept
free of WebAnno-specific information.
If we wanted to switch WebAnno to the new CAS implementation, we'd need
* some way of uniquely identifying FSes even across serialization. A short ID
like an integer would be convenient (i.e. no UUID)
* a fast (de-)serialization of the CAS
I understand that you consider adding both of these features anyway.
As an upgrade path for our users, we could provide a command line tool to
convert all data to XMI and then to a second tool to convert the XMI to a new
fast binary serialization format. It would be more convenient of course if both
CAS implementations could co-exist in the same JVM because then we wouldn't
need two tools for conversion (ok, we could do classloader magic to work around
this and actually load two instances of the framework but that's also not the
most trivial approach...).
Next I'll look at DKPro Core.
[1] http://webanno.googlecode.com
> Object-based CAS implementation proposal/prototype
> --------------------------------------------------
>
> Key: UIMA-4329
> URL: https://issues.apache.org/jira/browse/UIMA-4329
> Project: UIMA
> Issue Type: Brainstorming
> Components: Core Java Framework
> Reporter: Nick Hill
> Priority: Minor
> Attachments: uima-core_obj-0.2.jar, uimaj-core_obj-0.2.tar.gz
>
>
> I have been experimenting with a simplified CAS implementation where each
> feature structure is an object and the indices are based on standard Java SDK
> concurrent collection classes. This replaces the complex custom array-based
> heaps and index implementations.
> The primary motivation was to make the CAS threadsafe so that multiple
> annotators could process one concurrently, but I think there are a number of
> other benefits.
> Summary of advantages:
> - Drastic simplification of code - most proprietary data structure impls
> removed, many other classes removed, index/index repo impls are about 25% of
> the size of the heap versions (good for future enhancements/maintainability)
> - Thread safety - multiple logically independent annotators can work on the
> same CAS concurrently - reading, writing and iterating over feature
> structures. Opens up a lot of parallelism possibilities
> - No need for heap resizing or wasted space in fixed size CAS backing arrays,
> no large up-front memory cost for CASes - pooling them should no longer be
> necessary
> - Unlike the current heap impl, when a FS is removed from CAS indices it's
> space is actually freed (can be GC'd)
> - Unification of CAS and JCas - cover class instance (if it exists) "is" the
> feature structure
> - Significantly better performance (speed) for many use-cases, especially
> where there is heavy access of CAS data
> - Usage of standard Java data structure classes means it can benefit more
> "for free" from ongoing improvements in the java SDK and from hardware
> optimizations targeted at these classes
> I was hoping to see if there's interest from the community in taking this
> further, maybe even as a replacement for the current impl in a future version
> of uima-core. There has already been some discussion on the mailing list
> under the subject "Alternate CAS implementation".
> I'm attaching the current prototype, which should support most existing UIMA
> functionality with the exception of:
> - Binary serialization/deserialization
> - C/C++ framework (requires binary serialization)
> - "Delta" CAS related function including CAS markers
> - Index "auto protection" (recent 2.7 feature)
> Note I don't mean to imply these things can't be supported, just that they
> aren't yet.
> Where these things aren't used it should be possible to try out the attached
> uima-core.jar as a drop-in replacement with existing apps/frameworks. An
> important caveat though is that any existing JCas cover classes will need
> recompiling with the new jar (but not re-JCasGenning).
> I'll also attach the code. I started by basically ripping out the CAS heaps,
> so there's a lot of code which is just commented out (e.g. in CASImpl.java).
> Lots of cleanup/tidyup is still needed, and theres various places which still
> need fixing for threadsafety (e.g. synchronization around some existing
> create-on-first-access logic.. this is separate to the indices though). But
> those things shouldn't affect existing usage. A convention I followed was not
> to rename modified classes (e.g. CASImpl), but where an equivalent impl was
> created from scratch I did give it a new name starting with "CC" (e.g.
> FeatureStructureImpl is now CCFeatureStructure). The cc stood for "concurrent
> CAS". I have kept it in sync with the latest compatible changes in the
> uima-core stream, apart from those related to the non-impl'd functions
> mentioned above.
> Most of the "valid" unit tests work. Some are tied to the internals and no
> longer apply, many don't compile because they use binary serialization and/or
> delta CAS related classes which I removed for the time being. Some others I
> had to generalize a bit because for example they assumed a specific order in
> places where the order should be arbitrary, and maybe some other similar
> reasons.
> md5 checksums:
> {{94499c8f18f832fd1ded9106c64e8c76 *uima-core_obj-0.2.jar}}
> {{0cac18e89c616a8270e810f34b6468ad *uimaj-core_obj-0.2.tar.gz}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)