Quoting Richard Eckart de Castilho <[email protected]>:
For the record, what multi-threading scenarios are we talking about? Concurrent reading or concurrent modifications? If concurrent modifications, what kind of concurrent modifications should be supported (e.g. adding, deletion, changing existing feature values)?
Prior to 2.6 there were issues even with read-only CAS access from multiple threads but as Eddie mentioned, those have now been resolved. I'm talking about a more general case with multiple threads reading and/or writing from the same CAS. In practical terms this is a statement about the indices - i.e. adding/removing/iterating. Application/framework level logic would still be required to determine what makes sense to parallelize and I expect this would be based on logical independence.
As an example: say you had an NLP pipeline comprising multiple annotators, each reading certain types of feature structures and writing other types. Some may depend on others having run before (whose output they consume) but many might not (e.g. annotating based on just the doc text), or might depend on only one or two having run at the beginning, etc. You can see how this could be parallelized according to the dependency graph. With a threadsafe CAS this would involve just calling process on the same CAS object in multiple threads. If the annotators in question were expensive in terms of running time, this could provide a big improvement in latency.
I don't think it makes sense for the core thread-safety to extend to feature values themselves since the synchronization semantics would be application-specific. Examples of this might be a Feature Structure with an integer-valued "counter" feature whose value is incremented by different annotators. If those annotators were otherwise logically independent, custom synchronization would still need to be added around access to this feature before they could be run in parallel. This could be done in the annotators themselves (e.g. using the FS object in question as the monitor) or if a JCas cover class was always used, an increment method could be added to it so that the sync was encapsulated in there and the annotators modified to use that method. Or, a CAS linked list (FSList, StringList, etc) which is appended to by multiple annotators. To parallelize these, synchronization would also need to be added around the list modification operations (possibly encapsulated in higher level "add to list" functions).
Note that none of these things would affect the ability to continue running the same pipeline/annotators in the original single-threaded manner (e.g. with a non-threadsafe impl).
It would also mean annotators could be written with the parallelism specifically in mind, for example iterating over all annots of a particular type and putting them into a queue which is consumed by a threadpool of workers which process them writing other annotations directly back to the CAS. Or a fork/join pool to divide and conquer annotation of a large document.
Regards, Nick
