Re: Alternate CAS implementation

Marshall Schor Tue, 07 Apr 2015 14:57:46 -0700

OK, thanks!  Please be patient - volunteers at work (who have other "day" jobs
:-)  )


-Marshall

On 4/7/2015 4:18 PM, Nick Hill wrote:
> Per Marshall's suggestion I've created a "brainstorming" jira issue and
> attached the current prototype code:
>
> https://issues.apache.org/jira/browse/UIMA-4329
>
> Regards,
> Nick
>
> Quoting Peter Klügl <[email protected]>:
>
>> Hi Nick,
>>
>> I am (of course) also interested in the alternate CAS implementation.
>>
>> I agree with Marshall that the code should be attached to an jira issue so
>> that we can take a closer look and investigate its impact for the vairous
>> tools and libraries (in my case UIMA Ruta).
>>
>> Best,
>>
>> Peter
>>
>> Am 06.04.2015 um 02:04 schrieb Nick Hill:
>>> First I just want to emphasize that this proposal doesn't necessarily have
>>> full endorsement yet from Marshall, Eddie, et al. so I won't comment on the
>>> strategic roadmap questions. I'm just attempting to make the case for a
>>> direction which to me seems very natural.
>>>
>>> Logically UIMA is an object graph with some fixed type system, rooted in one
>>> or more collections with different ordering/uniqueness rules. These are all
>>> things Java provides out-of-the box, with a highly evolved heap/GC engine
>>> and numerous powerful SDK collection classes. Thus I would argue the most
>>> simple and "obvious" way to implement the UIMA specification would be a thin
>>> layer on top of these existing constructs, and there would need to be quite
>>> a compelling reason to deviate significantly from this.
>>>
>>> I completely understand that in the past such reasons existed, but it's now
>>> 10 years later and JVMs and hardware have moved on considerably. I'm fairly
>>> certain that those reasons are no longer valid today, and yet we are still
>>> paying the cost of significant complexity for something with more
>>> limitations than we would have otherwise. By carving out static chunks of
>>> heap and doing a proprietary form of "memory management" within them it
>>> prevents the JVM from optimizing how/where this data is stored and
>>> collected. This is very much working against how the JVM was designed to be
>>> used.
>>>
>>> The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic based CAS
>>> manipulation, so my understanding is that the only reason for performing low
>>> level or direct heap access is to get better performance out of the current
>>> array-based impl. I was assuming such usage would be rare by users of UIMA
>>> in general, but I could be wrong and it's useful to know people out there
>>> like you are doing it. I'd argue though the fact that it is even necessary
>>> is another reason for changing the approach (given that a goal of UIMA is to
>>> minimize effort required by NLP developers). Could you elaborate on your
>>> usage of internal APIs?
>>>
>>> Regarding serialization formats, each of them is just a well-defined serial
>>> representation of a CAS, so should not be affected by the runtime
>>> implementation.
>>> I do understand that the current binary formats derive from the CAS array
>>> internals, but it doesn't mean that this new impl couldn't read/write that
>>> same format. I expect here specifically there may be a relative performance
>>> impact because of the 'reconstruction' of the heaps that would be needed,
>>> however:
>>> - In a way, keeping the CAS in this form in memory could be seen as
>>> optimizing for speed of this specific binary format at the expense of slower
>>> and less flexible runtime CAS access
>>> - Alternative binary serialization mechanisms (and formats) could also be
>>> used similar to standard java object serialization, which I expect would be
>>> just as fast (although not default java serialization which is very
>>> inefficient)
>>> - I'd question in any case whether this alone should dictate the overall
>>> architecture choice
>>>
>>>> I was my impression in the past, that UIMA-Core has always valued
>>>> compatibility very high, even to the point of adding switches to re-enabled
>>>> buggy/undesired behavior in case somebody depended on it.
>>>
>>> I understand this, and I think I managed to keep things functionally
>>> identical. I'm not proposing any change in behaviour.
>>>
>>>> Changing the implementation of the CAS is probably the most radical idea
>>>> I've seen so far in this project.
>>>
>>> It might be radical in terms of the implementation change but again I would
>>> argue it's really just a simplification of the internals. It shouldn't be
>>> radical at all for users of UIMA in general.
>>>
>>>> Are we going to slay the holy cow of compatibility now and if yes at which
>>>> levels?
>>>
>>> Even for this change, the source incompatibility only applies to JCas cover
>>> classes, and only to those because of their current implementation-specific
>>> format.
>>>
>>>> What does such a change mean to the various sub-projects (DUCC, UIMA-AS,
>>>> RUTA, uimaFIT)?
>>>
>>> As long as these projects don't directly manipulate the CAS arrays, there
>>> should be zero impact to them apart from I would hope some performance
>>> benefits. It would also mean in future they could exploit the thread-safe
>>> nature of the CAS for various purposes.
>>>
>>> Regards,
>>> Nick
>>>
>>> Quoting Richard Eckart de Castilho <[email protected]>:
>>>
>>>> On 03.04.2015, at 22:51, Marshall Schor <[email protected]> wrote:
>>>>
>>>>> It may be good to open a "Brainstorming" Jira, and attach the code you're
>>>>> thinking of donating, so that people could study it and have a more 
>>>>> concrete
>>>>> idea about this.
>>>>>
>>>>> If it eventually gets accepted, we would also need a Software Grant for
>>>>> this, I
>>>>> think, due to the size.
>>>>
>>>> I was my impression in the past, that UIMA-Core has always valued
>>>> compatibility very high, even to the point of adding switches to re-enabled
>>>> buggy/undesired behavior in case somebody depended on it. Changing the
>>>> implementation of the CAS is probably the most radical idea I've seen so
>>>> far in this project. In principle, I very much like seeing UIMA to evolve,
>>>> but I do wonder how such a radical change is imagined to be undertaken.
>>>>
>>>> I'm aware that there are various levels of compatibility. My impression so
>>>> far was that source-compatibility was typically not sufficient in the past.
>>>>
>>>> Are we going to slay the holy cow of compatibility now and if yes at which
>>>> levels?
>>>>
>>>> Is there some willingness now to consider setting up a road-map for a
>>>> UIMA-Core version 3?
>>>>
>>>> What does such a change mean to the various sub-projects (DUCC, UIMA-AS,
>>>> RUTA, uimaFIT)?
>>>>
>>>> Personally, I'd be curious to see how much of e.g. DKPro Core or WebAnno
>>>> breaks with such a new implementation. I imagine quite a lot since I've
>>>> become quite fond of binary serialization and internal API usage lately (in
>>>> some cases I might be able to switch to official low-level CAS API...).
>>>> Although I'm very much for evolution and adopting newer technologies, I'm
>>>> afraid testing this (and potentially fixing stuff) will be quite time
>>>> intensive. Given that in my context, most of the benefits are not very
>>>> relevant so far, such testing would only make sense to me if it was part of
>>>> a larger strategic change - and I think that a properly licensed
>>>> contribution would be pretty much a pre-requisite to even look at it in
>>>> detail.
>>>>
>>>> Marshall, Eddie, and Nick do you have some vision of a strategic UIMA
>>>> roadmap that you can share with us?
>>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>>
>
>
>

Re: Alternate CAS implementation

Reply via email to