On 6/16/2010 10:22 AM, Erwan Moreau wrote: > Hi, > > Thanks for the answer. > > >> Hi, >> >> The CAS is not designed for concurrent access, to my knowledge, but >> perhaps others can comment more on this. >> >> > I'd like to know more about that, because imho this is a quite strong > limitation: maybe naively, I used to think that using concurrent access > only for reading was safe, since most concurrency problems occur when > threads can also write the shared object? >
This design, as I recall, was a performance trade-off, where we decided to have fast CAS access at higher priority than allowing multi-thread access, especially since the known and imagined use-cases had multiple-threads using separate CAS objects. Another factor here was the design of annotators - these are typically user-written code, done by algorithm experts, not necessarily software engineers experienced in the nuances of multi-threaded applications. So, we run annotator instances on just one thread; again, scale-out is done by instantiating multiple instances of annotators. So the annotators don't have to be "thread-safe" (except for static data, which is shared among the threads). >> Most scale-out use-cases are designs which also scale out the CASes. We >> would be interested in hearing about a use case which motivates >> multi-threaded access to a single CAS. >> >> > Indeed, my use-case probably does not correspond to what UIMA is > intended for. I must explain a bit the context: we are actually building > wrapper annotators for external programs called through a ProcessBuilder > object (yes, the dirty "exec" call). We are aware of the problems that > this implies, and ideally we would have re-coded our tools from scratch > as UIMA annotators or used C++ framework. Nevertheless we decided that > was the best choice, because our team owns a few complex NLP tools which > are the core of our work and would be very costly to migrate; so we want > to provide quite quickly a way to use them in a UIMA environment so that > people start using UIMA when creating higher level components (and maybe > these core components will be migrated later). > OK. UIMA has a C++ framework as well, if and when you get around to migrating your components. > In this context, we try to provide an "as safe and efficient as > possible" framework in which these programs are called inside an > annotator. That is why we use threads to provide the input stream and > read the output stream. In order to avoid wasting time and space, our > threads use Reader and Writer objects so that data is transmitted on the > fly to/from the process (inside the process method). Thus concurrent > access to the CAS is required when the Writer object that provides the > stdin stream is still reading annotations, while the Reader object has > already started to re-align the program output with the CAS content. Of > course no concurrency problem occurs if the input/output are transmitted > as simple String objets or as files, but that is clearly less efficient > (and not safer, as far as i know). > Seems that an easy work-around would be to have your reader and writer threads synchronize on their access to the CAS. If we implemented concurrent access, this is what we would have to do, inside the CAS itself. When new data are added to the CAS, indexes are often updated. If these are concurrently being accessed, *bad things* can happen, which is probably what's happening in your case. The CAS is used as a "unit-of-work" in many places in UIMA, as well. If you used it for this purpose, then a workflow might be: Have the Writer write to the process, so the process gets all its inputs, then have the reader read from the process the results. For scale-out, have multiple CASes. Would this work in your use case? -Marshall > I don't know whether there can be more standard use-cases using threads. > Nevertheless the problem would be the same if the black box was not an > external program but any piece of code that can not be modified and > behaves like a pipe. > > Erwan > > > >> -Marshall >> >> On 6/15/2010 1:35 PM, Erwan Moreau wrote: >> >> >>> Hello, >>> >>> I experience problems using several threads which read annotations in >>> the same (default) CAS index, inside the same call to the process >>> method. Since I'm new to UIMA I'm not sure how to interpret that: normal >>> behaviour due to wrong usage or bug ? The exception stack is: >>> >>> java.lang.IndexOutOfBoundsException: Index: 0, Size: 3 >>> at java.util.ArrayList.RangeCheck(ArrayList.java:547) >>> at java.util.ArrayList.get(ArrayList.java:322) >>> at >>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628) >>> >>> at >>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636) >>> >>> at >>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612) >>> >>> at >>> org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158) >>> >>> at >>> org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792) >>> >>> at >>> org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97) >>> >>> at >>> fr.lipn.uima.testing.TestConcurrentCASAccesAE.getFSIterator(TestConcurrentCASAccesAE.java:59) >>> >>> >>> I managed to isolate the problem and wrote a simple AE to explain/show >>> it (attached). >>> >>> Thanks for your help (and sorry if I missed something in the doc !) >>> >>> Erwan >>> >>> >>> >>> >>> > > >
