On 06.04.2015, at 02:04, Nick Hill <[email protected]> wrote:

> First I just want to emphasize that this proposal doesn't necessarily have 
> full endorsement yet from Marshall, Eddie, et al. so I won't comment on the 
> strategic roadmap questions. I'm just attempting to make the case for a 
> direction which to me seems very natural.

Maybe Eddie or Marshall could comment on this.

> Logically UIMA is an object graph with some fixed type system, rooted in one 
> or more collections with different ordering/uniqueness rules. These are all 
> things Java provides out-of-the box, with a highly evolved heap/GC engine and 
> numerous powerful SDK collection classes. Thus I would argue the most simple 
> and "obvious" way to implement the UIMA specification would be a thin layer 
> on top of these existing constructs, and there would need to be quite a 
> compelling reason to deviate significantly from this.

I briefly skimmed over the new CCFeatureStructure class (what does CC stand 
for? Why not stick to FeatureStructureImpl to stick to the naming conventions 
in UIMA?). There still appear to be per-feature-structure "heaps" (values array 
and intValues array).         

> I completely understand that in the past such reasons existed, but it's now 
> 10 years later and JVMs and hardware have moved on considerably. I'm fairly 
> certain that those reasons are no longer valid today, and yet we are still 
> paying the cost of significant complexity for something with more limitations 
> than we would have otherwise. By carving out static chunks of heap and doing 
> a proprietary form of "memory management" within them it prevents the JVM 
> from optimizing how/where this data is stored and collected. This is very 
> much working against how the JVM was designed to be used.

My understanding was that the heap organization in Java was made to resemble 
that in the UIMA C++ implementation and allowed for fast data exchange between 
Java and C++. That is why I was asking about the fate of UIMA C++. Again, maybe 
Marshall or Eddie can comment here.

> The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic based CAS 
> manipulation, so my understanding is that the only reason for performing low 
> level or direct heap access is to get better performance out of the current 
> array-based impl. I was assuming such usage would be rare by users of UIMA in 
> general, but I could be wrong and it's useful to know people out there like 
> you are doing it. I'd argue though the fact that it is even necessary is 
> another reason for changing the approach (given that a goal of UIMA is to 
> minimize effort required by NLP developers). Could you elaborate on your 
> usage of internal APIs?

I elaborated in the issue: 
https://issues.apache.org/jira/browse/UIMA-4329?focusedCommentId=14486883&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14486883

> Regarding serialization formats, each of them is just a well-defined serial 
> representation of a CAS, so should not be affected by the runtime 
> implementation.
> I do understand that the current binary formats derive from the CAS array 
> internals, but it doesn't mean that this new impl couldn't read/write that 
> same format. I expect here specifically there may be a relative performance 
> impact because of the 'reconstruction' of the heaps that would be needed, 
> however:
> - In a way, keeping the CAS in this form in memory could be seen as 
> optimizing for speed of this specific binary format at the expense of slower 
> and less flexible runtime CAS access
> - Alternative binary serialization mechanisms (and formats) could also be 
> used similar to standard java object serialization, which I expect would be 
> just as fast (although not default java serialization which is very 
> inefficient)
> - I'd question in any case whether this alone should dictate the overall 
> architecture choice

The CASCompleteSerializer may not be that well-defined. As I understand it, it 
allows to serialized the the heap structures from the CAS as-is using Java 
object serialization. As mentioned in the issue linked above, there are 
use-cases where the CASCompleteSerializer has proven quite useful. 

But I agree, the serialization format should be largely decoupled from the 
internal representation - but it should be fast and there are other useful 
properties such as maintaining CAS addresses (or other unique IDs outside the 
type system) ;)

Cheers,

-- Richard

Reply via email to