Hi,
On 6/16/2010 19:22, Erwan Moreau wrote:
Seems that an easy work-around would be to have your reader and writer
threads synchronize on their access to the CAS. If we implemented
concurrent access, this is what we would have to do, inside the CAS
itself.
When new data are added to the CAS, indexes are often updated. If
these
are concurrently being accessed, *bad things* can happen, which is
probably what's happening in your case.
Well, not exactly because I do not *write* any data in the CAS: threads
only read the annotations contained in the CAS, and in my real
annotators data is written in the CAS after all threads have terminated.
I'm not expert in thread-safety so I might miss something, but at first
sight I don't understand how concurrent read access can fail? (though I
must admit I did not try to study the source code in the
FSIndexRepositoryImpl class)
I agree, this should be possible. I'll take a look sometime
when our build has stabilized.
It may have to do with the way our internal iterator cache
works. What you could try to do is this: create one iterator
of every type you're interested in, in a sequential manner.
You don't need to use them. Then try your concurrent access
again. No guarantees though, I didn't even look at the code.
--Thilo
I was curious so I have investigated a bit more deeply about the
problems which arise when reading simultaneously in the CAS. I give
below my conclusions, in the hope they can be useful for future
implementations. I'm sorry I have only run tests using sources from
release 2.3.0. Please tell me if I should do something else (more
details, my testing environment etc.).
There are actually two places where things can go wrong:
1) Creating iterators simultaneously can either mess the data (the
annotations read do not correspond to the real ones), or sometimes cause
the following exception:
java.lang.IndexOutOfBoundsException: Index: 0, Size: 6
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.initPointerIterator(FSIndexRepositoryImpl.java:628)
at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:636)
at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$LeafPointerIterator.<init>(FSIndexRepositoryImpl.java:612)
at
org.apache.uima.cas.impl.FSIndexRepositoryImpl.createPointerIterator(FSIndexRepositoryImpl.java:158)
at
org.apache.uima.cas.impl.FSIndexRepositoryImpl$IndexImpl.iterator(FSIndexRepositoryImpl.java:792)
at
org.apache.uima.cas.impl.AnnotationIndexImpl.iterator(AnnotationIndexImpl.java:97)
at
erwan.TestConcurrentCASAccesAE.readSomeAnnotations(TestConcurrentCASAccesAE.java:205)
at
erwan.TestConcurrentCASAccesAE$CASReaderThread.run(TestConcurrentCASAccesAE.java:290)
at java.lang.Thread.run(Thread.java:619)
I succeeded in solving this problem in the following different ways:
- (user side) by creating the iterators sequentially before starting the
threads, or by synchronizing the calls.
- (uima side) in the org.apache.uima.cas.impl.FSIndexRepositoryImpl
class: actually the problem is due to the fact in the
createPointerIterator methods 1) the call to
iicp.createIndexIteratorCache() creates some data (I don't really know
what I'm talking about actually!) which is stored in the iicp object, 2)
then the initPointerIterator method (called by new
[Leaf]PointerIterator(iicp)) reads this data that may have been modified
in the "concurrent access" case. Thus I tested transmitting this object
classically (iicp.createIndexIteratorCache() returning an
ArrayList<FSLeafIndexImpl> object and other methods receiving it as a
parameter), and that works fine (this error does not appear anymore,
tested with my test case over more than 200000 runs).
2) Calling simultanously the next() (or hasNex()) method (in two
different FSIterator objects, of course) causes exceptions like the
following:
java.lang.ArrayIndexOutOfBoundsException: 1381
at org.apache.uima.jcas.impl.JCasHashMap.get(JCasHashMap.java:117)
at
org.apache.uima.jcas.impl.JCasImpl.getJfsFromCaddr(JCasImpl.java:1044)
at
org.apache.uima.jcas.impl.JCasImpl$JCasFsGenerator.createFS(JCasImpl.java:830)
at org.apache.uima.cas.impl.CASImpl.ll_getFSForRef(CASImpl.java:3106)
at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:1762)
at
org.apache.uima.cas.impl.FSIteratorWrapper.get(FSIteratorWrapper.java:48)
at
org.apache.uima.cas.impl.FSIteratorImplBase.next(FSIteratorImplBase.java:67)
at
org.apache.uima.cas.impl.FSIteratorImplBase.next(FSIteratorImplBase.java:33)
at
erwan.TestConcurrentCASAccesAE.getNextAnnotation(TestConcurrentCASAccesAE.java:130)
at
erwan.TestConcurrentCASAccesAE.readSomeAnnotations(TestConcurrentCASAccesAE.java:213)
at
erwan.TestConcurrentCASAccesAE$CASReaderThread.run(TestConcurrentCASAccesAE.java:290)
at java.lang.Thread.run(Thread.java:619)
These errors happen even if the iterators had been created before
starting the threads.
As you told me, this one is certainly due to the caching strategy. The
comments in the org.apache.uima.jcas.impl.JCasImpl class are clear about
the fact that the implementation is intended to be single-threaded
(although that point is not documented the API I think). Once again I
ran som tests:
- on the user side, the problem can be solved by synchronizing each call
to next() indeed.
- I have also tested a simple modification in the
org.apache.uima.jcas.impl.JCasImpl class: in the
JCasFsGenerator.createFS method, removing the call to
jcasView.putJfsFromCaddr(addr, fs) solves the problem (also tested over
more than 200000 tests without error); I guess that corresponds roughly
to disabling the caching strategy, since nothing is written in it
anymore (?). I don't know what are the performance consequences of such
a modification, but maybe an option could be proposed to disable the
cache ? Imho it could at least be documented that this class is not
thread-safe, because it seems to me quite unusual to have to synchronize
the calls to next().
thanks for your work!
Erwan