OK, thanks! Please be patient - volunteers at work (who have other "day" jobs :-) )
-Marshall On 4/7/2015 4:18 PM, Nick Hill wrote: > Per Marshall's suggestion I've created a "brainstorming" jira issue and > attached the current prototype code: > > https://issues.apache.org/jira/browse/UIMA-4329 > > Regards, > Nick > > Quoting Peter Klügl <[email protected]>: > >> Hi Nick, >> >> I am (of course) also interested in the alternate CAS implementation. >> >> I agree with Marshall that the code should be attached to an jira issue so >> that we can take a closer look and investigate its impact for the vairous >> tools and libraries (in my case UIMA Ruta). >> >> Best, >> >> Peter >> >> Am 06.04.2015 um 02:04 schrieb Nick Hill: >>> First I just want to emphasize that this proposal doesn't necessarily have >>> full endorsement yet from Marshall, Eddie, et al. so I won't comment on the >>> strategic roadmap questions. I'm just attempting to make the case for a >>> direction which to me seems very natural. >>> >>> Logically UIMA is an object graph with some fixed type system, rooted in one >>> or more collections with different ordering/uniqueness rules. These are all >>> things Java provides out-of-the box, with a highly evolved heap/GC engine >>> and numerous powerful SDK collection classes. Thus I would argue the most >>> simple and "obvious" way to implement the UIMA specification would be a thin >>> layer on top of these existing constructs, and there would need to be quite >>> a compelling reason to deviate significantly from this. >>> >>> I completely understand that in the past such reasons existed, but it's now >>> 10 years later and JVMs and hardware have moved on considerably. I'm fairly >>> certain that those reasons are no longer valid today, and yet we are still >>> paying the cost of significant complexity for something with more >>> limitations than we would have otherwise. By carving out static chunks of >>> heap and doing a proprietary form of "memory management" within them it >>> prevents the JVM from optimizing how/where this data is stored and >>> collected. This is very much working against how the JVM was designed to be >>> used. >>> >>> The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic based CAS >>> manipulation, so my understanding is that the only reason for performing low >>> level or direct heap access is to get better performance out of the current >>> array-based impl. I was assuming such usage would be rare by users of UIMA >>> in general, but I could be wrong and it's useful to know people out there >>> like you are doing it. I'd argue though the fact that it is even necessary >>> is another reason for changing the approach (given that a goal of UIMA is to >>> minimize effort required by NLP developers). Could you elaborate on your >>> usage of internal APIs? >>> >>> Regarding serialization formats, each of them is just a well-defined serial >>> representation of a CAS, so should not be affected by the runtime >>> implementation. >>> I do understand that the current binary formats derive from the CAS array >>> internals, but it doesn't mean that this new impl couldn't read/write that >>> same format. I expect here specifically there may be a relative performance >>> impact because of the 'reconstruction' of the heaps that would be needed, >>> however: >>> - In a way, keeping the CAS in this form in memory could be seen as >>> optimizing for speed of this specific binary format at the expense of slower >>> and less flexible runtime CAS access >>> - Alternative binary serialization mechanisms (and formats) could also be >>> used similar to standard java object serialization, which I expect would be >>> just as fast (although not default java serialization which is very >>> inefficient) >>> - I'd question in any case whether this alone should dictate the overall >>> architecture choice >>> >>>> I was my impression in the past, that UIMA-Core has always valued >>>> compatibility very high, even to the point of adding switches to re-enabled >>>> buggy/undesired behavior in case somebody depended on it. >>> >>> I understand this, and I think I managed to keep things functionally >>> identical. I'm not proposing any change in behaviour. >>> >>>> Changing the implementation of the CAS is probably the most radical idea >>>> I've seen so far in this project. >>> >>> It might be radical in terms of the implementation change but again I would >>> argue it's really just a simplification of the internals. It shouldn't be >>> radical at all for users of UIMA in general. >>> >>>> Are we going to slay the holy cow of compatibility now and if yes at which >>>> levels? >>> >>> Even for this change, the source incompatibility only applies to JCas cover >>> classes, and only to those because of their current implementation-specific >>> format. >>> >>>> What does such a change mean to the various sub-projects (DUCC, UIMA-AS, >>>> RUTA, uimaFIT)? >>> >>> As long as these projects don't directly manipulate the CAS arrays, there >>> should be zero impact to them apart from I would hope some performance >>> benefits. It would also mean in future they could exploit the thread-safe >>> nature of the CAS for various purposes. >>> >>> Regards, >>> Nick >>> >>> Quoting Richard Eckart de Castilho <[email protected]>: >>> >>>> On 03.04.2015, at 22:51, Marshall Schor <[email protected]> wrote: >>>> >>>>> It may be good to open a "Brainstorming" Jira, and attach the code you're >>>>> thinking of donating, so that people could study it and have a more >>>>> concrete >>>>> idea about this. >>>>> >>>>> If it eventually gets accepted, we would also need a Software Grant for >>>>> this, I >>>>> think, due to the size. >>>> >>>> I was my impression in the past, that UIMA-Core has always valued >>>> compatibility very high, even to the point of adding switches to re-enabled >>>> buggy/undesired behavior in case somebody depended on it. Changing the >>>> implementation of the CAS is probably the most radical idea I've seen so >>>> far in this project. In principle, I very much like seeing UIMA to evolve, >>>> but I do wonder how such a radical change is imagined to be undertaken. >>>> >>>> I'm aware that there are various levels of compatibility. My impression so >>>> far was that source-compatibility was typically not sufficient in the past. >>>> >>>> Are we going to slay the holy cow of compatibility now and if yes at which >>>> levels? >>>> >>>> Is there some willingness now to consider setting up a road-map for a >>>> UIMA-Core version 3? >>>> >>>> What does such a change mean to the various sub-projects (DUCC, UIMA-AS, >>>> RUTA, uimaFIT)? >>>> >>>> Personally, I'd be curious to see how much of e.g. DKPro Core or WebAnno >>>> breaks with such a new implementation. I imagine quite a lot since I've >>>> become quite fond of binary serialization and internal API usage lately (in >>>> some cases I might be able to switch to official low-level CAS API...). >>>> Although I'm very much for evolution and adopting newer technologies, I'm >>>> afraid testing this (and potentially fixing stuff) will be quite time >>>> intensive. Given that in my context, most of the benefits are not very >>>> relevant so far, such testing would only make sense to me if it was part of >>>> a larger strategic change - and I think that a properly licensed >>>> contribution would be pretty much a pre-requisite to even look at it in >>>> detail. >>>> >>>> Marshall, Eddie, and Nick do you have some vision of a strategic UIMA >>>> roadmap that you can share with us? >>>> >>>> Cheers, >>>> >>>> -- Richard >>> > > >
