It may be good to open a "Brainstorming" Jira, and attach the code you're thinking of donating, so that people could study it and have a more concrete idea about this.
If it eventually gets accepted, we would also need a Software Grant for this, I think, due to the size. -Marshall On 4/2/2015 3:55 PM, Nick Hill wrote: > Thanks Richard, more replies below... > > Quoting Richard Eckart de Castilho <[email protected]>: > >> Hi Nick, >> >> On 02.04.2015, at 01:37, Nick Hill <[email protected]> wrote: >> >>>> From my point of view, it would be nice if it was possible to configure the >>>> UIMA framework to produce either this new kind of CAS or the old one >>>> without having to exchange a JAR - doing so statically at initialization >>>> time or even dynamically at runtime. E.g. to allow easily running test >>>> cases against both implementations. >>> >>> When you say "produce", there shouldn't be any visible difference in >>> anything output or persisted, the impl is just how the CAS is stored >>> internally in memory while processing is happening. >>> >>> It won't be possible to switch the impl being used at runtime. There are >>> classes for example with the same names but different impls (e.g. CASImpl). >>> I know this isn't ideal for tests/comparisons between the two impls but >>> quite a lot of things are currently tightly-coupled to the heap internals >>> and so switching a jar doesn't seem too big a price to pay given no other >>> code changes are needed. >> >> What do you plan to be the ultimate goal of this experiment? Is it to support >> different CAS implementations or is it to replace the existing CAS >> implementation with a totally different one? >> >> Most things in UIMA are created through factories (not the CAS so far). So >> theoretically, one could replace most classes by custom classes by >> reconfiguring the framework to use different factory classes or having the >> factories produce different implementations. Can you imagine that as well for >> the CAS? > > For users the implementation shouldn't matter. They shouldn't observe any > functional difference and therefore shouldn't really care if the impl changes > underneath. All consuming code should work as-is, with the exception of code > which accesses 'internals' directly - but I'd see this as analogous to > accessing private fields in some java SDK class, which breaks when those > fields change in a newer SDK version. > > As such I don't think it would make sense (or be very practical from a > maintenance pov) to support two implementations concurrently or to have a > factory. > >> Does it mean that the UIMA-C++ implementation is going to be discontinued >> officially? > > No, just to clarify no agreements or plans have been made. I just wanted to > initiate a discussion around this as a possible idea. > If we were to pursue this alternate implementation, I don't know of any reason > why the C++ impl would be discontinued. I had just listed C++ AEs as one of > the things which don't yet work with my current prototype. > >>>> Having to recompile the JCas classes is a bit of a blocker to me - but I >>>> remember that Marshall was contemplating about a way to generate JCas >>>> classes at runtime, so this might just be a temporary blocker. >>> >>> When I say recompile, I don't mean regenerate using JCasGen, just recompile >>> .class files from the existing jcas .java files. I would expect that you >>> would typically only be using one version (other than for comparison >>> purposes - to validate functional equivalence and/or compare performance), >>> and so this isn't something that would need to be done often. >> >> Compiled JCas classes tend to be shipped as part of frameworks. This means >> that it will not be possible to switch to a new CAS impl just by replacing a >> JAR. It will also mean that components from different UIMA-based frameworks >> cannot be mixed and matched anymore unless some broker like UIMA-AS is used. > > The current JCas cover class format is quite old and tightly-coupled to the > heap-based CAS internals. Saying that all new versions of UIMA must be > binary-compatible with these therefore imposes a (somewhat crippling) > restriction on possible internal improvements. You might say that the current > JCas classes break standard abstraction/encapsulation principles if the > expectation is they will be forever forwards binary-compatible. > > It would not be hard on the UIMA side to move to a simpler and more abstract > JCas cover class format that should avoid this problem in future, but the > actual move to such a format would be even more disruptive than requiring a > recompilation (would require a re-JCasGen), and would have the same issues you > mention above. > > I managed to make this object-based impl at least source-compatible with > existing jcas cover classes, by 'converting' the impl of methods called that > were intended to make CAS heap changes to actually be manipulating the FS > objects directly. > >>>> In one context, we also rely heavily on CAS addresses serving as unique >>>> identifiers of feature structures in the CAS. Does your implementation >>>> provide any stable feature structure IDs, preferably ones that are part of >>>> the system and not actually declared as features? >>> >>> Yes, there are various cases where an 'equivalent' of an FS address is >>> required (for example if the LL API is being used). In this case the id gets >>> allocated on the fly and will subsequently be unique to that FS within the >>> CAS. In many cases an FS might never have such an ID allocated (it's not >>> really part of the non-LL "public" APIs), but you can always 'request' one. >> >> I imagine that IDs would be necessary to implement stuff like delta-CAS later >> on too. >> >> Are any of the changes so far in any way related to potentially allowing >> additions to the type system at runtime? > > Not directly related; my goal was just to make the implementation functionally > equivalent but threadsafe (and simpler, faster). > But it's possible (not certain) this new impl may impose fewer barriers to > enabling such capability. > >> What would be the incentive/benefit for the developer of a UIMA-based >> framework/applications or for the users of such frameworks/applications to >> switch to the new implementation? > > That was the "summary of advantages" I had in the original email, I've > included it again below. The primary "external" benefits I think are the CAS > being thread-safe and faster to manipulate. I understand that many > users/developers might not care about these things, just as they likely > wouldn't care about the code footprint or complexity of the internals, but it > also shouldn't adversely impact them to "upgrade" to a new UIMA version based > on this implementation. > > I feel that not being able to have more than one thread work on a CAS at the > same time is a major limitation, especially given modern systems typically > have many CPU cores. > > - Drastic simplification of code - most proprietary data structure impls > removed, many other classes removed, index/index repo impls are about 25% of > the size of the heap versions (good for future enhancements/maintainability) > - Thread safety - multiple logically independent annotators can work on the > same CAS concurrently - reading, writing and iterating over feature > structures. Opens up a lot of parallelism possibilities > - No need for heap resizing or wasted space in fixed size CAS backing arrays, > no large up-front memory cost for CASes - pooling them should no longer be > necessary > - Unlike the current heap impl, when a FS is removed from CAS indices it's > space is actually freed (can be GC'd) > - Unification of CAS and JCas - cover class instance (if it exists) "is" the > feature structure > - Significantly better performance (speed) for many use-cases, especially > where there is heavy access of CAS data > - Usage of standard Java data structure classes means it can benefit more "for > free" from ongoing improvements in the java SDK and from hardware > optimizations targeted at these classes > >> >> Cheers, >> >> -- Richard > > >
