Hi, While the work in building an up-to-date UIMACPP is underway, this discussion of whether it makes sense to have one at all is really key.
On Mon, Dec 12, 2022 at 12:22 PM Richard Eckart de Castilho <r...@apache.org> wrote: > On 12. Dec 2022, at 17:20, Eddie Epstein <eaepst...@gmail.com> wrote: > > > > The combination of uimacpp + amq-cpp does support non-JNI > interoperability > > with uimaj. Although I am only familiar with using uimacpp as a remote > > component for uimaj applications, it may not be hard for uimacpp to be an > > application that uses uimaj remote annotators. It is not clear to me that > > such CAS interface connectivity would be more useful than standard > > messaging interfaces from python to a java-based process. > > > > What scenarios do you have in mind? > > Setting up a message queueing system is extra overhead. It is certainly the > more scalable solution, but it is not at the level of convenience of > "pip install" or adding a Maven dependency. > > Personally, I had mostly the issue of being able to call Python DL > frameworks > from Java. For my purposes at the moment, wrapping deploying the Python > stuff > in a Docker container and exposing it as a HTTP (micro)service using CAS > XMI > or CAS JSON as the wire protocol does the trick. Not very efficient but > otherwise a good balance between effort and effect. > > But I'd be interested in the use-cases that others have. The questions from > my last mail were meant as potential hooks that others could use to bring > up their scenarios as well. > > That said, I think we lost some people on the way because UIMA didn't have > a > good answer when the Python wave rolled in. When people ask me today, I > point them to Cassis which addresses those parts that pained me most (e.g. > I use it in those Docker containers mentioned above). But again, my > perspective > is only one of many and I would be interested in hearing other's views. > I was not aware of dkpro-cassis and it looks really nice. If the objective is to wrap annotators, it seems the way to go. In terms of wrapping annotators as remote server calls, UIMA enables sending only the annotations needed and receiving back only the annotations requested. This is a major win that can be easily communicated to users. Somebody can run a BERT server that returns either [CLS] embedding or full per-word embeddings without having to do anything themselves to select one or the other (or both), it is handled by the framework based on the requested feature structures. Also, by using offsets, a client can send 20Mb of text for chapter segmentation and return back the chapter offsets without having to receive the 20Mb back. (Such behaviour is quite common with simple JSON APIs.) This is the type of thing I do for the products of my company, by the way [1]. I'm interested in a much deeper integration of Python and UIMA. The CAS was a tough sell before but with the advent of dataframes, it is a very natural extension of the dataframe concept to unstructured information. And using managed memory in a language with very poor garbage collectors like Python is also a big win (although I don't know how many people using Python will see that). Here is some dream concept code: https://gist.github.com/DrDub/9413410626b5a77d8f1f576f6447d64e (getting the syntax and approach right will take a lot of iterations and consultations of course) It is predicated on the possibility of some projects embracing UIMA and shipping their own wrappers. I know NLTK considered it at some moment [2]. For some projects we might wrap them ourselves (at least at the beginning). The example includes a remote AE doing BERT embeddings but such AEs are all remote these days given their hardware and long boot times requirements. Of course doing a useful pip-installable UIMA package won't be easy as it will have to ship precompiled binaries for many architectures [3]. There, GitHub actions might come very handy [4]. So my goal is about better NLP in Python using the existing tools the Python ecosystem has. Integration with uimaj is unclear if the message queue infrastructure is not available in UIMA3. Using an embedded JVM has worked very well for me in the past and it doesn't need to fiddle with all this C++ complexity [5]. I'd really love to call UIMA Ruta scripts in Python. I feel spaCy rule-based matching wants to do that but the lack of an abstraction like UIMA stops it on its tracks [6]. P [1] https://epub-highlighter.com [2] https://groups.google.com/g/nltk-issues/c/7_OAdglKi8Y/m/sJSHQbJm7tMJ [3] https://python-packaging-tutorial.readthedocs.io/en/latest/binaries_dependencies.html [4] https://pythonprogramming.org/automatically-building-python-package-using-github-actions/ [5] http://duboue.net/blog7.html [6] https://spacy.io/usage/rule-based-matching