On Wed, 2015-06-24 at 14:31 +0200, Thilo Goetz wrote:
> > Marshall already did some nice work on JSON serialization, so I
> think there is movement into that direction.
>
> Just to be very clear: that is not good enough. I want a JSON format
> that I can read and write without the help of the framework. From my
> datastructures, into my datastructures. In some programming language
> that hasn't been invented yet. Simple enough that I don't need to
> absorb
> and reimplement the whole UIMA philosophy.
>
> >
> > But what I don't understand is how a data format resolves to "less
> framework". The data format is basically addressing ingestion and
> export, but not processing or pipelines. Even if you have a simple
> data format like JSON, there's still the need to run analysis, right?
> Is the analysis in your scenario just a black box? And in order to
> apply the analysis, you'll need some kind API - how do you imagine it?
>
> The analysis is a black box, yes. What else could it be? I don't care
> how the POS tagger does what it does. All I'm interested in is what
> it
> needs as input, and how it gives me the output. I can parse JSON into
> Java pojos with jackson for example, that's super simple. Writing
> them
> out is even easier. What APIs do I need other than being able to tell
> some piece of analysis to do its stuff on a bunch of data?

One thing which must have been overlooked when UIMA was built is that
people (like me) have to write code which wants to interact with the CAS
but can't be an AE. In UIMA the CAS (either in memory, or serialized)
is difficult to
be used without implementing an AE. In those scenarios you usually
have to deal with some kind
of serialized CAS anyway. Today, it is really easy to serialize a CAS
into XMI, but that format
is not trivial to deal with at all.

And if you would like to interact with it in a different programming
language the entry barrier is
so high that I have never seen it anywhere (except our C++ layer). It
is probably easier to build
something similar for that particular use case.

Here, JSON would really help to be more compatible with different environments.
Reading, modifying and adding objects to a JSON structure can be done
in most programming languages without much overhead (if the structure
is not too complex).
Sometimes there is even direct support for JSON, e.g. in ElasticSearch
or browsers.
And soon also in Java.

It should be much easier to serialize/deserialize a CAS.
The best practice today is to implement an AE to achieve that, but
that again is not
nice, when I don't want to deal with AEs.

An AE is great to add structure to a document. After that is done there
is often code which work on that structured data. That cold be a mapreduce
job that is counting the number of tokens in a document collection.

In those cases it would be really really nice to just create/deserialize a CAS
and program against the CAS instead of rebuilding the parts of it that
are needed, e.g. iterating Person annotations in the order in which
they occur in the text, only iterating tokens inside a sentence, etc.

The CAS is also not flexible enough when it gets to the really simple cases,
maybe I just want to process only one FeatureStructure per CAS with an
AE I already built.
Some of my AEs only work on higher level FSes, like  a Person Entity.
Why is there so much overhead in creating a CAS with just one FS?

And today it is cumbersome to work with it in Java. The CAS interface
doesn't allow
me to use POJOs and JCas is too complex (e.g. code generation).

For UIMA v3 I really hope that we can rebuild the CAS so it is
something that was build today and
not 15 years ago.

Jörn

Reply via email to