On 6/24/2015 3:12 AM, Richard Eckart de Castilho wrote: > On 23.06.2015, at 17:11, Marshall Schor <[email protected]> wrote: > >> I added a wiki page to develop the ideas here. >> >> This is what I got from reading this: >> >> One idea is having an annotator not have a type system specification, but >> rather >> have it dynamically create types / features according to some configuration >> info >> or some dynamically-obtained information (perhaps the results of some >> previous >> analysis). > I think that right now, an annotator doesn't need to have a type system > specification. This seems true, especially for "generic" kinds of annotators that aren't tied to particular types/features. Those annotators have the ability to query the TypeSystem and we have APIs that use indirection in specifying both types and features. That is, instead of saying instance_of_annotation.setBegin(123), where instance_of_annotation is statically typed to be a (subtype of) annotation, and the begin feature is hard-coded directly into the code via "setBegin", you could write someFS.setIntFeature(someIndirectionToAfeature, 123), where someFS could be typed as a generic FeatureStructure, and someIndirectToAfeature could be of the generic type Feature, and set to the particular feature elsewhere.
This indirection has pros and cons; the pros: enables generic annotators where the types and features are not explicit in the code. cons: when the types and features are known, then the annotator code can be easier to read, as it has less indirection. Also there can be performance / space issues with indirection. > The specification is necessary to create a CAS, but not to create an > annotator. > With uimaFIT, it is common to first create a CAS (based on types automatically > detected in the classpath), then fill that CAS with some initial information > (avoiding a reader for easy embedding into an application), and then to pass > that CAS through an aggregate. While uimaFIT also adds the automatically > detected > types to every analysis engine description in the aggregate, I believe this > is not really necessary because the CAS has already been initialized. > > Independent from that is the problem that the type system is locked after the > CAS has been created. Engines such as Ruta would profit if the type system > would > at least allow compatible changes such as adding new types or adding new > features > to existing types. The types may not be known at the time the CAS is > initialized, > but only when the CAS is actually being processed. Some languages (Ruby, Javascript) allow dynamic modification of classes. So new types can be defined, and new features can be added to classes. In fact, I found this web article which lists a very long list of languages (Java not among them) where fields can be added to a class at runtime: http://rosettacode.org/wiki/Add_a_variable_to_a_class_instance_at_runtime In Java, you can add classes at runtime; but modifying existing classes (to add additional fields) is not supported. UIMA's current design (where Java is optional) might be able to be extended to support new types and additional fields, at some cost in performance/space. The recently proposed cas-object design could also partially support this I think. (It couldn't support 1) create a FS with 3 types, 2) add feature # 4, 3) set feature # 4 in the already created FS). More dynamic data structures of course do support this idea of dynamically extensible Types. Other alternative JCas approaches which generate a full JCas cover class automatically from the merged type systems, would also have problems with adding features to existing Types, but could define dynamic new types. Finally, we could modify the Java cover class design to support a hybrid - those things known ahead could be statically typed, and those things added dynamically could be handled with more flexible augmentations embedded into the generated class; maybe this allows the best of both worlds. The usual pros/cons apply. >> Another idea is having an annotator be able to read Feature Structure data >> from >> a wide variety of sources, and have the data include the type/feature >> metadata >> (either externally - as we do now in UIMA with a type system external XML >> specification, or embedded - like JSON would naturally do). Such an >> annotator >> would have some notion of the type / feature information it was interested in >> processing, but could ignore the rest. > Let's see... > > a) easier ingestion of data into feature structures, optimally by > automatically > creating FSes based on a (typed) external data description. E.g. a JSON > object > like > > { "fs1": {"feature1": "value1", "feature2": 10 } > > would be converted to a FS with a string feature1 and a numeric feature2. > However, the type of the FS would basically be underspecified in the type > system as the next feature structure read could have the same features > using different value ranges and in fact the type of the FS itself is > unknown. Sounds as if heading towards some kind of duck-typing e.g. for > annotations (if it has a begin/end, then it is an annotation). An interesting thing to observe is that in this direction of "simplicity", the ideas of Views and Sofas and Indexes might be optional? A thought experiment: is there a decomposition for UIMA facilities that can omit these kinds of things if not "needed", yet gradually include this functionality for more complex implementations? > > b) the part about the type/feature information that an annotator is interested > in but being able to ignore the rest I didn't get. This is the concept (already present in the way UIMA deserializers operate for remote annotators) that when reading an external representation, you don't have to be able to handle all the types and features. You can "ignore" those you don't recognize, and just work with those you're interested in. > >> Finally, a third idea is to have the componentization be such that no "UIMA >> Framework" was needed, or if present, it's hidden. I'm thinking that this >> means, for simpler analytics, the idea of a pipe line, and combining things, >> would not be present; it would be more like just a single annotator. For >> more >> complex things, the idea of a pipeline would be encapsulated (like UIMA's >> Aggregates), and the whole thing would look like something that could be >> embedded, in any of the other "big data" frameworks as an analysis piece. >> The >> implication is that this would enable using other frameworks' scaleout >> mechanisms. > uimaFIT goes a long way in "hiding" the bulk of the UIMA framework and > providing a rather sane Java API for pipelines. It makes the creation of a > POJO > wrapper around them a breeze. People do use this to embed UIMA in alternative > scale-out frameworks such as Hadoop. > > Just for the sake of knowing where this is going, assuming the UIMA core API > as a baseline and the uimaFIT API as an improvement, how would this further > improvement look like? It might look like some kind of layering, stripping out complexity (until needed). (See thought experiment, above). > > > Or would the issue be solvable by integrating uimaFIT into the core (e.g. to > avoid re-approval of libraries by company legal departments)? I don't this integration solves this issue, but integrating uimaFIT into the core seems like a good thing to work on (it's an item in the v3 wiki page). -Marshall
