[ 
https://issues.apache.org/jira/browse/UIMA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486825#comment-14486825
 ] 

Nick Hill commented on UIMA-4329:
---------------------------------

I tried it out with uimaFIT - in particular ran the unit tests in uimafit-core 
and uimafit-examples against it. There were a few errors:
- One compilation error where {{FeatureStructureImpl}} is referenced. I've 
added an empty 'dummy' class with the same name for now to prevent this error 
(the actual impl isn't needed)
- Some test failures related to xcas deserialization. This was a one small bug 
where a check needed to be moved, now fixed
- Some tests failed which relied on the {{FeatureStructure}} toString format 
that I hadn't impl'd. I copied the toString logic from the current impl and 
these tests now pass
- There were 3 remaining failed tests, all for the same reason which looks 
"legitimate". The tests assume the {{select()}} methods always returns FS's in 
the same order for different supertypes, but these use 
{{FSCollectionFactory.create()}} which contains logic to use either 
{{cas.getAnnotationIndex(type)}} or 
{{cas.getIndexRepository().getAllIndexedFS(type)}}. The former of these will be 
in a deterministic order but the latter may be ordered arbitrarily

I've updated the attached files with "v0.2" versions which contain these fixes 
plus some other minor cleanup/refactoring.

h5. Note on test times
All the tests seemed to run at a similar speed or faster, with the exception of 
{{JCasUtilTest.testSelectCoverRandom()}} which was slower. This appeared to be 
due to the different sorted index impl approach, which I think is faster for 
some use cases but slower for others. It should not be hard to modify it to be 
segmented by type similar to the bag index impl / existing sorted impl. I did 
also test a new {{AnnotationIndex.subiterator(start,end)}} method which 
'directly' returns an iterator over all spanned annotations i.e. avoiding 
acrobatics currently required - this actually made the test in question faster 
than the existing impl, but I didn't include this in the attached update. It 
would require a change to consuming code which wouldn't be compatible with the 
current impl (although that might be a nice method to add to 
{{AnnotationIndex}} in any case!)

> Object-based CAS implementation proposal/prototype
> --------------------------------------------------
>
>                 Key: UIMA-4329
>                 URL: https://issues.apache.org/jira/browse/UIMA-4329
>             Project: UIMA
>          Issue Type: Brainstorming
>          Components: Core Java Framework
>            Reporter: Nick Hill
>            Priority: Minor
>         Attachments: uima-core_obj-0.2.jar, uimaj-core_obj-0.2.tar.gz
>
>
> I have been experimenting with a simplified CAS implementation where each 
> feature structure is an object and the indices are based on standard Java SDK 
> concurrent collection classes. This replaces the complex custom array-based 
> heaps and index implementations.
> The primary motivation was to make the CAS threadsafe so that multiple 
> annotators could process one concurrently, but I think there are a number of 
> other benefits.
> Summary of advantages:
> - Drastic simplification of code - most proprietary data structure impls 
> removed, many other classes removed, index/index repo impls are about 25% of 
> the size of the heap versions (good for future enhancements/maintainability)
> - Thread safety - multiple logically independent annotators can work on the 
> same CAS concurrently - reading, writing and iterating over feature 
> structures. Opens up a lot of parallelism possibilities
> - No need for heap resizing or wasted space in fixed size CAS backing arrays, 
> no large up-front memory cost for CASes - pooling them should no longer be 
> necessary
> - Unlike the current heap impl, when a FS is removed from CAS indices it's 
> space is actually freed (can be GC'd)
> - Unification of CAS and JCas - cover class instance (if it exists) "is" the 
> feature structure
> - Significantly better performance (speed) for many use-cases, especially 
> where there is heavy access of CAS data
> - Usage of standard Java data structure classes means it can benefit more 
> "for free" from ongoing improvements in the java SDK and from hardware 
> optimizations targeted at these classes
> I was hoping to see if there's interest from the community in taking this 
> further, maybe even as a replacement for the current impl in a future version 
> of uima-core. There has already been some discussion on the mailing list 
> under the subject "Alternate CAS implementation".
> I'm attaching the current prototype, which should support most existing UIMA 
> functionality with the exception of:
> - Binary serialization/deserialization
> - C/C++ framework (requires binary serialization)
> - "Delta" CAS related function including CAS markers
> - Index "auto protection" (recent 2.7 feature)
> Note I don't mean to imply these things can't be supported, just that they 
> aren't yet.
> Where these things aren't used it should be possible to try out the attached 
> uima-core.jar as a drop-in replacement with existing apps/frameworks. An 
> important caveat though is that any existing JCas cover classes will need 
> recompiling with the new jar (but not re-JCasGenning).
> I'll also attach the code. I started by basically ripping out the CAS heaps, 
> so there's a lot of code which is just commented out (e.g. in CASImpl.java). 
> Lots of cleanup/tidyup is still needed, and theres various places which still 
> need fixing for threadsafety (e.g. synchronization around some existing 
> create-on-first-access logic.. this is separate to the indices though). But 
> those things shouldn't affect existing usage. A convention I followed was not 
> to rename modified classes (e.g. CASImpl), but where an equivalent impl was 
> created from scratch I did give it a new name starting with "CC" (e.g. 
> FeatureStructureImpl is now CCFeatureStructure). The cc stood for "concurrent 
> CAS". I have kept it in sync with the latest compatible changes in the 
> uima-core stream, apart from those related to the non-impl'd functions 
> mentioned above.
> Most of the "valid" unit tests work. Some are tied to the internals and no 
> longer apply, many don't compile because they use binary serialization and/or 
> delta CAS related classes which I removed for the time being. Some others I 
> had to generalize a bit because for example they assumed a specific order in 
> places where the order should be arbitrary, and maybe some other similar 
> reasons.
> md5 checksums:
> {{4fd19b5f804fe8d505f697240c8e0366 *uima-core.jar}}
> {{51826aa44111b7f6e1fa307393eda8f4 *uimaj-core_obj.tar.gz}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to