Re: Alternate CAS implementation

Eddie Epstein Wed, 15 Apr 2015 12:30:38 -0700

Parallelizing annotators to speed up processing as you describe sounds
attractive, except for all the ways the annotators can conflict with each
other and the difficulty in detecting/debugging. The current UIMA approach
to parallelism is for the flow controller to send a CAS in parallel to
multiple annotators and let the results be merged back into a single CAS
after all are done. All index updates would be isolated from each other,
and as is done now the merging process would detect incompatible changes


Of course parallel CAS processing is currently only supported for remote
annotators, but that can could be fixed by making additional in memory CAS
copies [with the turbo charged CasCopier recently introduced] and creating
a new in memory CAS merger that worked like the merger in the
CasDeserializer. Given these changes in the core it would be easy to add
support for this in UIMA-AS or any other higher level threading framework.

Eddie



On Tue, Apr 14, 2015 at 2:24 AM, Nick Hill <[email protected]> wrote:

>
> Quoting Richard Eckart de Castilho <[email protected]>:
>
>  For the record, what multi-threading scenarios are we talking about?
>> Concurrent reading or concurrent modifications? If concurrent
>> modifications, what kind of concurrent modifications should be supported
>> (e.g. adding, deletion, changing existing feature values)?
>>
>>
> Prior to 2.6 there were issues even with read-only CAS access from
> multiple threads but as Eddie mentioned, those have now been resolved. I'm
> talking about a more general case with multiple threads reading and/or
> writing from the same CAS. In practical terms this is a statement about the
> indices - i.e. adding/removing/iterating. Application/framework level logic
> would still be required to determine what makes sense to parallelize and I
> expect this would be based on logical independence.
>
> As an example: say you had an NLP pipeline comprising multiple annotators,
> each reading certain types of feature structures and writing other types.
> Some may depend on others having run before (whose output they consume) but
> many might not (e.g. annotating based on just the doc text), or might
> depend on only one or two having run at the beginning, etc. You can see how
> this could be parallelized according to the dependency graph. With a
> threadsafe CAS this would involve just calling process on the same CAS
> object in multiple threads. If the annotators in question were expensive in
> terms of running time, this could provide a big improvement in latency.
>
> I don't think it makes sense for the core thread-safety to extend to
> feature values themselves since the synchronization semantics would be
> application-specific. Examples of this might be a Feature Structure with an
> integer-valued "counter" feature whose value is incremented by different
> annotators. If those annotators were otherwise logically independent,
> custom synchronization would still need to be added around access to this
> feature before they could be run in parallel. This could be done in the
> annotators themselves (e.g. using the FS object in question as the monitor)
> or if a JCas cover class was always used, an increment method could be
> added to it so that the sync was encapsulated in there and the annotators
> modified to use that method. Or, a CAS linked list (FSList, StringList,
> etc) which is appended to by multiple annotators. To parallelize these,
> synchronization would also need to be added around the list modification
> operations (possibly encapsulated in higher level "add to list" functions).
>
> Note that none of these things would affect the ability to continue
> running the same pipeline/annotators in the original single-threaded manner
> (e.g. with a non-threadsafe impl).
>
> It would also mean annotators could be written with the parallelism
> specifically in mind, for example iterating over all annots of a particular
> type and putting them into a queue which is consumed by a threadpool of
> workers which process them writing other annotations directly back to the
> CAS. Or a fork/join pool to divide and conquer annotation of a large
> document.
>
> Regards,
> Nick
>

Re: Alternate CAS implementation

Reply via email to