Re: Alternate CAS implementation

Nick Hill Mon, 13 Apr 2015 23:17:28 -0700


Quoting Richard Eckart de Castilho <[email protected]>:

For the record, what multi-threading scenarios are we talking about?Concurrent reading or concurrent modifications? If concurrentmodifications, what kind of concurrent modifications should besupported (e.g. adding, deletion, changing existing feature values)?

Prior to 2.6 there were issues even with read-only CAS access frommultiple threads but as Eddie mentioned, those have now been resolved.I'm talking about a more general case with multiple threads readingand/or writing from the same CAS. In practical terms this is astatement about the indices - i.e. adding/removing/iterating.Application/framework level logic would still be required to determinewhat makes sense to parallelize and I expect this would be based onlogical independence.

As an example: say you had an NLP pipeline comprising multipleannotators, each reading certain types of feature structures andwriting other types. Some may depend on others having run before(whose output they consume) but many might not (e.g. annotating basedon just the doc text), or might depend on only one or two having runat the beginning, etc. You can see how this could be parallelizedaccording to the dependency graph. With a threadsafe CAS this wouldinvolve just calling process on the same CAS object in multiplethreads. If the annotators in question were expensive in terms ofrunning time, this could provide a big improvement in latency.

I don't think it makes sense for the core thread-safety to extend tofeature values themselves since the synchronization semantics would beapplication-specific. Examples of this might be a Feature Structurewith an integer-valued "counter" feature whose value is incremented bydifferent annotators. If those annotators were otherwise logicallyindependent, custom synchronization would still need to be addedaround access to this feature before they could be run in parallel.This could be done in the annotators themselves (e.g. using the FSobject in question as the monitor) or if a JCas cover class was alwaysused, an increment method could be added to it so that the sync wasencapsulated in there and the annotators modified to use that method.Or, a CAS linked list (FSList, StringList, etc) which is appended toby multiple annotators. To parallelize these, synchronization wouldalso need to be added around the list modification operations(possibly encapsulated in higher level "add to list" functions).

Note that none of these things would affect the ability to continuerunning the same pipeline/annotators in the original single-threadedmanner (e.g. with a non-threadsafe impl).

It would also mean annotators could be written with the parallelismspecifically in mind, for example iterating over all annots of aparticular type and putting them into a queue which is consumed by athreadpool of workers which process them writing other annotationsdirectly back to the CAS. Or a fork/join pool to divide and conquerannotation of a large document.


Regards,
Nick

Re: Alternate CAS implementation

Reply via email to