Hi,

While the work in building an up-to-date UIMACPP is underway, this
discussion of whether it makes sense to have one at all is really key.

On Mon, Dec 12, 2022 at 12:22 PM Richard Eckart de Castilho <r...@apache.org>
wrote:

> On 12. Dec 2022, at 17:20, Eddie Epstein <eaepst...@gmail.com> wrote:
> >
> > The combination of uimacpp + amq-cpp does support non-JNI
> interoperability
> > with uimaj. Although I am only familiar with using uimacpp as a remote
> > component for uimaj applications, it may not be hard for uimacpp to be an
> > application that uses uimaj remote annotators. It is not clear to me that
> > such CAS interface connectivity would be more useful than standard
> > messaging interfaces from python to a java-based process.
> >
> > What scenarios do  you have in mind?
>
> Setting up a message queueing system is extra overhead. It is certainly the
> more scalable solution, but it is not at the level of convenience of
> "pip install" or adding a Maven dependency.
>
> Personally, I had mostly the issue of being able to call Python DL
> frameworks
> from Java. For my purposes at the moment, wrapping deploying the Python
> stuff
> in a Docker container and exposing it as a HTTP (micro)service using CAS
> XMI
> or CAS JSON as the wire protocol does the trick. Not very efficient but
> otherwise a good balance between effort and effect.
>
> But I'd be interested in the use-cases that others have. The questions from
> my last mail were meant as potential hooks that others could use to bring
> up their scenarios as well.
>
> That said, I think we lost some people on the way because UIMA didn't have
> a
> good answer when the Python wave rolled in. When people ask me today, I
> point them to Cassis which addresses those parts that pained me most (e.g.
> I use it in those Docker containers mentioned above). But again, my
> perspective
> is only one of many and I would be interested in hearing other's views.
>

I was not aware of dkpro-cassis and it looks really nice. If the objective
is to wrap annotators, it seems the way to go.

In terms of wrapping annotators as remote server calls, UIMA enables
sending only the annotations needed and receiving back only the annotations
requested. This is a major win that can be easily communicated to users.
Somebody can run a BERT server that returns either [CLS] embedding or full
per-word embeddings without having to do anything themselves to select one
or the other (or both), it is handled by the framework based on the
requested feature structures.

Also, by using offsets, a client can send 20Mb of text for chapter
segmentation and return back the chapter offsets without having to receive
the 20Mb back. (Such behaviour is quite common with simple JSON APIs.) This
is the type of thing I do for the products of my company, by the way [1].

I'm interested in a much deeper integration of Python and UIMA. The CAS was
a tough sell before but with the advent of dataframes, it is a very natural
extension of the dataframe concept to unstructured information. And using
managed memory in a language with very poor garbage collectors like Python
is also a big win (although I don't know how many people using Python will
see that).

Here is some dream concept code:
https://gist.github.com/DrDub/9413410626b5a77d8f1f576f6447d64e  (getting
the syntax and approach right will take a lot of iterations and
consultations of course)

It is predicated on the possibility of some projects embracing UIMA and
shipping their own wrappers. I know NLTK considered it at some moment [2].
For some projects we might wrap them ourselves (at least at the beginning).

The example includes a remote AE doing BERT embeddings but such AEs are all
remote these days given their hardware and long boot times requirements.

Of course doing a useful pip-installable UIMA package won't be easy as it
will have to ship precompiled binaries for many architectures [3]. There,
GitHub actions might come very handy [4].

So my goal is about better NLP in Python using the existing tools the
Python ecosystem has. Integration with uimaj is unclear if the message
queue infrastructure is not available in UIMA3. Using an embedded JVM has
worked very well for me in the past and it doesn't need to fiddle with all
this C++ complexity [5]. I'd really love to call UIMA Ruta scripts in
Python. I feel spaCy rule-based matching wants to do that but the lack of
an abstraction like UIMA stops it on its tracks [6].

P

[1] https://epub-highlighter.com
[2] https://groups.google.com/g/nltk-issues/c/7_OAdglKi8Y/m/sJSHQbJm7tMJ
[3]
https://python-packaging-tutorial.readthedocs.io/en/latest/binaries_dependencies.html
[4]
https://pythonprogramming.org/automatically-building-python-package-using-github-actions/
[5] http://duboue.net/blog7.html
[6] https://spacy.io/usage/rule-based-matching

Reply via email to