On Thu, Dec 29, 2022 at 7:38 AM Richard Eckart de Castilho <r...@apache.org> wrote:
> > > On 29. Dec 2022, at 13:01, Pablo Duboue <pablo.dub...@gmail.com> wrote: > > > > Here is some dream concept code: > > https://gist.github.com/DrDub/9413410626b5a77d8f1f576f6447d64e (getting > > the syntax and approach right will take a lot of iterations and > > consultations of course) > > Thanks for the example code :) It has some interesting ideas. I'll > consider them based on my background with Cassis and UIMA-J. > > > == Type system > > I can see that you imagine defining types in a natural pythonic way here. > For Cassis, we chose a different approach that is based on a type system > definition (either programmatically [1] created or loaded from XML [2]) and > then uses factory methods to generate type classes (comparable to JCas > classes). > > We needed the type classes to have special properties and we wanted to be > able to handle UIMA features like type system merging - so we couldn't go > with simple Python classes. > I tried to make it similar to ORM frameworks in Python that address a similar concern. Python is a very dynamic language, it should be possible to do all the type system merging, etc over Python classes. > == Access to CAS contents > > Your python code seems inspired by the UIMAv2 CAS index API. > Well, that's what UIMA CPP supports. > UIMAv3 introduces a new "select" API for retrieving FSes from the CAS [3]. > This was inspired by the popular "select" methods of uimaFIT. In cassis, a > simple version of select has been implemented [4] which feels more like the > uimaFIT methods than like the V3 select API. > Yes, I'm familiar with umaFIT select. > Note that Cassis does not support indices or type priorities. To be > honest, those always seemed to be more in the way than helpful anyway. The > UIMAv3 select API by also default ignores type priorities (can be turned on > though for a given select call). > Type priorities were indeed a rare bird. But type indices are mighty useful. So UIMAv3 has no indices at all? Getting an iterator over annotations that fall inside another annotation is a very common task (sentences within paragraphs, tokens within sentences, etc). It is one of the few constructs that other NLP frameworks provide. > == Component concept > > The Python annotation with component metadata on the analysis engine class > looks interesting. I wonder if you need the indexes though. Can you not > work simply with the built-in annotation index? > Wouldn't that be slow? Iterate over thousands of annotations for only a few paragraph annotations? At any rate UIMA CPP has the indices so it'll go very fast. > == Data mapping > > The `wrap` code in there looks very interesting, e.g. > > ----- > SetFeature({MyNER.Source: "spaCy"}).wrap( > TypeMapper(output={spacy.Sentence: MySentence, spacy.NER: > MyNER}).wrap( > SpacyAnnotator({"load": "en"}) > ) > ) > ----- > Thanks :-) The need of type mapping code arised at a customer site [1] and I always found it a missing piece in the framework. P [1] https://www.javatips.net/api/type-mapper-for-uima-master/src/main/java/com/radialpoint/uima/typemapper/TypeMapper.java