[ 
https://issues.apache.org/jira/browse/UIMA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hill updated UIMA-4329:
----------------------------
    Description: 
I have been experimenting with a simplified CAS implementation where each 
feature structure is an object and the indices are based on standard Java SDK 
concurrent collection classes. This replaces the complex custom array-based 
heaps and index implementations.

The primary motivation was to make the CAS threadsafe so that multiple 
annotators could process one concurrently, but I think there are a number of 
other benefits.

Summary of advantages:
- Drastic simplification of code - most proprietary data structure impls 
removed, many other classes removed, index/index repo impls are about 25% of 
the size of the heap versions (good for future enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work on the 
same CAS concurrently - reading, writing and iterating over feature structures. 
Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing arrays, 
no large up-front memory cost for CASes - pooling them should no longer be 
necessary
- Unlike the current heap impl, when a FS is removed from CAS indices it's 
space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists) "is" the 
feature structure
- Significantly better performance (speed) for many use-cases, especially where 
there is heavy access of CAS data
- Usage of standard Java data structure classes means it can benefit more "for 
free" from ongoing improvements in the java SDK and from hardware optimizations 
targeted at these classes

I was hoping to see if there's interest from the community in taking this 
further, maybe even as a replacement for the current impl in a future version 
of uima-core. There has already been some discussion on the mailing list under 
the subject "Alternate CAS implementation".

I'm attaching the current prototype, which should support most existing UIMA 
functionality with the exception of:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)

Note I don't mean to imply these things can't be supported, just that they 
aren't yet.

Where these things aren't used it should be possible to try out the attached 
uima-core.jar as a drop-in replacement with existing apps/frameworks. An 
important caveat though is that any existing JCas cover classes will need 
recompiling with the new jar (but not re-JCasGenning).

I'll also attach the code. I started by basically ripping out the CAS heaps, so 
there's a lot of code which is just commented out (e.g. in CASImpl.java). Lots 
of cleanup/tidyup is still needed, and theres various places which still need 
fixing for threadsafety (e.g. synchronization around some existing 
create-on-first-access logic.. this is separate to the indices though). But 
those things shouldn't affect existing usage. A convention I followed was not 
to rename modified classes (e.g. CASImpl), but where an equivalent impl was 
created from scratch I did give it a new name starting with "CC" (e.g. 
FeatureStructureImpl is now CCFeatureStructure). The cc stood for "concurrent 
CAS". I have kept it in sync with the latest compatible changes in the 
uima-core stream, apart from those related to the non-impl'd functions 
mentioned above.

Most of the "valid" unit tests work. Some are tied to the internals and no 
longer apply, many don't compile because they use binary serialization and/or 
delta CAS related classes which I removed for the time being. Some others I had 
to generalize a bit because for example they assumed a specific order in places 
where the order should be arbitrary, and maybe some other similar reasons.

md5 checksums:
{{69f8e01eda8576960a3e6324a0d03d77 *uima-core_obj-0.5.jar}}
{{3b90ebc78035c68c8c86b31abc8b3b68 *uima-core_obj-0.5.tar.gz}}


  was:
I have been experimenting with a simplified CAS implementation where each 
feature structure is an object and the indices are based on standard Java SDK 
concurrent collection classes. This replaces the complex custom array-based 
heaps and index implementations.

The primary motivation was to make the CAS threadsafe so that multiple 
annotators could process one concurrently, but I think there are a number of 
other benefits.

Summary of advantages:
- Drastic simplification of code - most proprietary data structure impls 
removed, many other classes removed, index/index repo impls are about 25% of 
the size of the heap versions (good for future enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work on the 
same CAS concurrently - reading, writing and iterating over feature structures. 
Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing arrays, 
no large up-front memory cost for CASes - pooling them should no longer be 
necessary
- Unlike the current heap impl, when a FS is removed from CAS indices it's 
space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists) "is" the 
feature structure
- Significantly better performance (speed) for many use-cases, especially where 
there is heavy access of CAS data
- Usage of standard Java data structure classes means it can benefit more "for 
free" from ongoing improvements in the java SDK and from hardware optimizations 
targeted at these classes

I was hoping to see if there's interest from the community in taking this 
further, maybe even as a replacement for the current impl in a future version 
of uima-core. There has already been some discussion on the mailing list under 
the subject "Alternate CAS implementation".

I'm attaching the current prototype, which should support most existing UIMA 
functionality with the exception of:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)

Note I don't mean to imply these things can't be supported, just that they 
aren't yet.

Where these things aren't used it should be possible to try out the attached 
uima-core.jar as a drop-in replacement with existing apps/frameworks. An 
important caveat though is that any existing JCas cover classes will need 
recompiling with the new jar (but not re-JCasGenning).

I'll also attach the code. I started by basically ripping out the CAS heaps, so 
there's a lot of code which is just commented out (e.g. in CASImpl.java). Lots 
of cleanup/tidyup is still needed, and theres various places which still need 
fixing for threadsafety (e.g. synchronization around some existing 
create-on-first-access logic.. this is separate to the indices though). But 
those things shouldn't affect existing usage. A convention I followed was not 
to rename modified classes (e.g. CASImpl), but where an equivalent impl was 
created from scratch I did give it a new name starting with "CC" (e.g. 
FeatureStructureImpl is now CCFeatureStructure). The cc stood for "concurrent 
CAS". I have kept it in sync with the latest compatible changes in the 
uima-core stream, apart from those related to the non-impl'd functions 
mentioned above.

Most of the "valid" unit tests work. Some are tied to the internals and no 
longer apply, many don't compile because they use binary serialization and/or 
delta CAS related classes which I removed for the time being. Some others I had 
to generalize a bit because for example they assumed a specific order in places 
where the order should be arbitrary, and maybe some other similar reasons.

md5 checksums:
{{9058c655c0d7e6cdc08ade343588bb6b *uima-core_obj-0.4.jar}}
{{6ff2d7a86de1a906921ab60bc63499ed *uimaj-core_obj-0.4.tar.gz}}



> Object-based CAS implementation proposal/prototype
> --------------------------------------------------
>
>                 Key: UIMA-4329
>                 URL: https://issues.apache.org/jira/browse/UIMA-4329
>             Project: UIMA
>          Issue Type: Brainstorming
>          Components: Core Java Framework
>            Reporter: Nick Hill
>            Priority: Minor
>         Attachments: uima-core_obj-0.5.jar, uima-core_obj-0.5.tar.gz
>
>
> I have been experimenting with a simplified CAS implementation where each 
> feature structure is an object and the indices are based on standard Java SDK 
> concurrent collection classes. This replaces the complex custom array-based 
> heaps and index implementations.
> The primary motivation was to make the CAS threadsafe so that multiple 
> annotators could process one concurrently, but I think there are a number of 
> other benefits.
> Summary of advantages:
> - Drastic simplification of code - most proprietary data structure impls 
> removed, many other classes removed, index/index repo impls are about 25% of 
> the size of the heap versions (good for future enhancements/maintainability)
> - Thread safety - multiple logically independent annotators can work on the 
> same CAS concurrently - reading, writing and iterating over feature 
> structures. Opens up a lot of parallelism possibilities
> - No need for heap resizing or wasted space in fixed size CAS backing arrays, 
> no large up-front memory cost for CASes - pooling them should no longer be 
> necessary
> - Unlike the current heap impl, when a FS is removed from CAS indices it's 
> space is actually freed (can be GC'd)
> - Unification of CAS and JCas - cover class instance (if it exists) "is" the 
> feature structure
> - Significantly better performance (speed) for many use-cases, especially 
> where there is heavy access of CAS data
> - Usage of standard Java data structure classes means it can benefit more 
> "for free" from ongoing improvements in the java SDK and from hardware 
> optimizations targeted at these classes
> I was hoping to see if there's interest from the community in taking this 
> further, maybe even as a replacement for the current impl in a future version 
> of uima-core. There has already been some discussion on the mailing list 
> under the subject "Alternate CAS implementation".
> I'm attaching the current prototype, which should support most existing UIMA 
> functionality with the exception of:
> - Binary serialization/deserialization
> - C/C++ framework (requires binary serialization)
> - "Delta" CAS related function including CAS markers
> - Index "auto protection" (recent 2.7 feature)
> Note I don't mean to imply these things can't be supported, just that they 
> aren't yet.
> Where these things aren't used it should be possible to try out the attached 
> uima-core.jar as a drop-in replacement with existing apps/frameworks. An 
> important caveat though is that any existing JCas cover classes will need 
> recompiling with the new jar (but not re-JCasGenning).
> I'll also attach the code. I started by basically ripping out the CAS heaps, 
> so there's a lot of code which is just commented out (e.g. in CASImpl.java). 
> Lots of cleanup/tidyup is still needed, and theres various places which still 
> need fixing for threadsafety (e.g. synchronization around some existing 
> create-on-first-access logic.. this is separate to the indices though). But 
> those things shouldn't affect existing usage. A convention I followed was not 
> to rename modified classes (e.g. CASImpl), but where an equivalent impl was 
> created from scratch I did give it a new name starting with "CC" (e.g. 
> FeatureStructureImpl is now CCFeatureStructure). The cc stood for "concurrent 
> CAS". I have kept it in sync with the latest compatible changes in the 
> uima-core stream, apart from those related to the non-impl'd functions 
> mentioned above.
> Most of the "valid" unit tests work. Some are tied to the internals and no 
> longer apply, many don't compile because they use binary serialization and/or 
> delta CAS related classes which I removed for the time being. Some others I 
> had to generalize a bit because for example they assumed a specific order in 
> places where the order should be arbitrary, and maybe some other similar 
> reasons.
> md5 checksums:
> {{69f8e01eda8576960a3e6324a0d03d77 *uima-core_obj-0.5.jar}}
> {{3b90ebc78035c68c8c86b31abc8b3b68 *uima-core_obj-0.5.tar.gz}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to