Parallelizing annotators to speed up processing as you describe sounds attractive, except for all the ways the annotators can conflict with each other and the difficulty in detecting/debugging. The current UIMA approach to parallelism is for the flow controller to send a CAS in parallel to multiple annotators and let the results be merged back into a single CAS after all are done. All index updates would be isolated from each other, and as is done now the merging process would detect incompatible changes
Of course parallel CAS processing is currently only supported for remote annotators, but that can could be fixed by making additional in memory CAS copies [with the turbo charged CasCopier recently introduced] and creating a new in memory CAS merger that worked like the merger in the CasDeserializer. Given these changes in the core it would be easy to add support for this in UIMA-AS or any other higher level threading framework. Eddie On Tue, Apr 14, 2015 at 2:24 AM, Nick Hill <[email protected]> wrote: > > Quoting Richard Eckart de Castilho <[email protected]>: > > For the record, what multi-threading scenarios are we talking about? >> Concurrent reading or concurrent modifications? If concurrent >> modifications, what kind of concurrent modifications should be supported >> (e.g. adding, deletion, changing existing feature values)? >> >> > Prior to 2.6 there were issues even with read-only CAS access from > multiple threads but as Eddie mentioned, those have now been resolved. I'm > talking about a more general case with multiple threads reading and/or > writing from the same CAS. In practical terms this is a statement about the > indices - i.e. adding/removing/iterating. Application/framework level logic > would still be required to determine what makes sense to parallelize and I > expect this would be based on logical independence. > > As an example: say you had an NLP pipeline comprising multiple annotators, > each reading certain types of feature structures and writing other types. > Some may depend on others having run before (whose output they consume) but > many might not (e.g. annotating based on just the doc text), or might > depend on only one or two having run at the beginning, etc. You can see how > this could be parallelized according to the dependency graph. With a > threadsafe CAS this would involve just calling process on the same CAS > object in multiple threads. If the annotators in question were expensive in > terms of running time, this could provide a big improvement in latency. > > I don't think it makes sense for the core thread-safety to extend to > feature values themselves since the synchronization semantics would be > application-specific. Examples of this might be a Feature Structure with an > integer-valued "counter" feature whose value is incremented by different > annotators. If those annotators were otherwise logically independent, > custom synchronization would still need to be added around access to this > feature before they could be run in parallel. This could be done in the > annotators themselves (e.g. using the FS object in question as the monitor) > or if a JCas cover class was always used, an increment method could be > added to it so that the sync was encapsulated in there and the annotators > modified to use that method. Or, a CAS linked list (FSList, StringList, > etc) which is appended to by multiple annotators. To parallelize these, > synchronization would also need to be added around the list modification > operations (possibly encapsulated in higher level "add to list" functions). > > Note that none of these things would affect the ability to continue > running the same pipeline/annotators in the original single-threaded manner > (e.g. with a non-threadsafe impl). > > It would also mean annotators could be written with the parallelism > specifically in mind, for example iterating over all annots of a particular > type and putting them into a queue which is consumed by a threadpool of > workers which process them writing other annotations directly back to the > CAS. Or a fork/join pool to divide and conquer annotation of a large > document. > > Regards, > Nick >
