Re: Concurrent access to CAS index

Erwan Moreau Thu, 24 Jun 2010 12:12:26 -0700

Hi,

On 6/16/2010 19:22, Erwan Moreau wrote:

Seems that an easy work-around would be to have your reader and writer
threads synchronize on their access to the CAS.  If we implemented
concurrent access, this is what we would have to do, inside the CAS
itself.

When new data are added to the CAS, indexes are often updated. Ifthese

are concurrently being accessed, *bad things* can happen, which is
probably what's happening in your case.

Well, not exactly because I do not *write* any data in the CAS: threads
only read the annotations contained in the CAS, and in my real
annotators data is written in the CAS after all threads have terminated.
I'm not expert in thread-safety so I might miss something, but at first
sight I don't understand how concurrent read access can fail? (though I
must admit I did not try to study the source code in the
FSIndexRepositoryImpl class)


I agree, this should be possible.  I'll take a look sometime
when our build has stabilized.

It may have to do with the way our internal iterator cache
works.  What you could try to do is this: create one iterator
of every type you're interested in, in a sequential manner.
You don't need to use them.  Then try your concurrent access
again.  No guarantees though, I didn't even look at the code.

--Thilo

I was curious so I have investigated a bit more deeply about theproblems which arise when reading simultaneously in the CAS. I givebelow my conclusions, in the hope they can be useful for futureimplementations. I'm sorry I have only run tests using sources fromrelease 2.3.0. Please tell me if I should do something else (moredetails, my testing environment etc.).


There are actually two places where things can go wrong:

1) Creating iterators simultaneously can either mess the data (theannotations read do not correspond to the real ones), or sometimes causethe following exception:

java.lang.IndexOutOfBoundsException: Index: 0, Size: 6
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)

atorg.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)atorg.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)atorg.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)atorg.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)atorg.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)atorg.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)aterwan.TestConcurrentCASAccesAE.readSomeAnnotations(TestConcurrentCASAccesAE.java:205)aterwan.TestConcurrentCASAccesAE$CASReaderThread.run(TestConcurrentCASAccesAE.java:290)

   at java.lang.Thread.run(Thread.java:619)

I succeeded in solving this problem in the following different ways:

- (user side) by creating the iterators sequentially before starting thethreads, or by synchronizing the calls.- (uima side) in the org.apache.uima.cas.impl.FSIndexRepositoryImplclass: actually the problem is due to the fact in thecreatePointerIterator methods 1) the call toiicp.createIndexIteratorCache() creates some data (I don't really knowwhat I'm talking about actually!) which is stored in the iicp object, 2)then the initPointerIterator method (called by new[Leaf]PointerIterator(iicp)) reads this data that may have been modifiedin the "concurrent access" case. Thus I tested transmitting this objectclassically (iicp.createIndexIteratorCache() returning anArrayList<FSLeafIndexImpl> object and other methods receiving it as aparameter), and that works fine (this error does not appear anymore,tested with my test case over more than 200000 runs).

2) Calling simultanously the next() (or hasNex()) method (in twodifferent FSIterator objects, of course) causes exceptions like thefollowing:

java.lang.ArrayIndexOutOfBoundsException: 1381
   at org.apache.uima.jcas.impl.JCasHashMap.get(JCasHashMap.java:117)

atorg.apache.uima.jcas.impl.JCasImpl.getJfsFromCaddr(JCasImpl.java:1044)atorg.apache.uima.jcas.impl.JCasImpl$JCasFsGenerator.createFS(JCasImpl.java:830)

   at org.apache.uima.cas.impl.CASImpl.ll_getFSForRef(CASImpl.java:3106)
   at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:1762)

atorg.apache.uima.cas.impl.FSIteratorWrapper.get(FSIteratorWrapper.java:48)atorg.apache.uima.cas.impl.FSIteratorImplBase.next(FSIteratorImplBase.java:67)atorg.apache.uima.cas.impl.FSIteratorImplBase.next(FSIteratorImplBase.java:33)aterwan.TestConcurrentCASAccesAE.getNextAnnotation(TestConcurrentCASAccesAE.java:130)aterwan.TestConcurrentCASAccesAE.readSomeAnnotations(TestConcurrentCASAccesAE.java:213)aterwan.TestConcurrentCASAccesAE$CASReaderThread.run(TestConcurrentCASAccesAE.java:290)

   at java.lang.Thread.run(Thread.java:619)

These errors happen even if the iterators had been created beforestarting the threads.As you told me, this one is certainly due to the caching strategy. Thecomments in the org.apache.uima.jcas.impl.JCasImpl class are clear aboutthe fact that the implementation is intended to be single-threaded(although that point is not documented the API I think). Once again Iran som tests:- on the user side, the problem can be solved by synchronizing each callto next() indeed.- I have also tested a simple modification in theorg.apache.uima.jcas.impl.JCasImpl class: in theJCasFsGenerator.createFS method, removing the call tojcasView.putJfsFromCaddr(addr, fs) solves the problem (also tested overmore than 200000 tests without error); I guess that corresponds roughlyto disabling the caching strategy, since nothing is written in itanymore (?). I don't know what are the performance consequences of sucha modification, but maybe an option could be proposed to disable thecache ? Imho it could at least be documented that this class is notthread-safe, because it seems to me quite unusual to have to synchronizethe calls to next().


thanks for your work!
Erwan

Re: Concurrent access to CAS index

Reply via email to