Thanks for explaining this "use case". I was a bit unclear on the two instances of deserialization time. One (the 70%) was xmi, the other (2%) was S+. From reading the email chain, it seems S+ is the "CasCompleteSerializer". This switches to plain binary mode. So you would avoid the XML parsing overhead.
But I think both deserializations would have the same issue around "allow_dups" if that was where the substantial part of the slowdown was being spent, since both would add all those annotations to the index. Perhaps that was another use case though... Am I mixing these up? -Marshall On 1/7/2016 2:38 AM, Peter Klügl wrote: > Hi, > > rule engineering in ruta follows a paradigm where you spam annotations, > also with same type and offsets. Additionally, and probably the most > critical case, the explanation of the rule inference creates those > annotations for debugging. Imaging something like an annotation of the > type "Debug" for every text span where some rule tried to match (and did > not even succeed). > > In one of my use cases, the time spent in deserialization in the CAS > Editor was reduced from 70% (xmi) to 2% (binary). > > Here's the discussion: > https://issues.apache.org/jira/browse/UIMA-4685 > > Best, > > Peter > > > Am 07.01.2016 um 02:38 schrieb Marshall Schor: >> Thanks for the information! >> >> It does look like there's a performance issue if not allowing duplicate >> adds, for >> exactly the use case you mentioned: lots of FSs which compare "equal" >> according >> to the >> sorted index keys, but which are not the same FS. >> >> This can be fixed I think. >> >> There's also a user-centered workaround, for some cases, when it's possible >> to >> re-define the type system somewhat. >> >> One kind of thing I've seen frequently, is that people define types having >> nothing to do with Annotation (e.g. they don't use begin / end, etc.) as >> subtypes of Annotation. >> >> If you are able to change the type system definitions so that these things no >> longer are subtypes of Annotation, then the problem might go away. >> >> -Marshall >> On 1/6/2016 6:18 PM, Richard Eckart de Castilho wrote: >>>>> I am starting to get suspicious of global flags for backwards >>>>> compatibility. >>>>> E.g. since ALLOW_DUP_ADD_TO_INDEXES was introduced, we have people >>>>> complaining >>>>> about a performance drop. ALLOW_DUP_ADD_TO_INDEXES can only be >>>>> enabled/disabled >>>>> globally, but not specifically for individual indexes. Neither can it be >>>>> temporarily disabled, e.g. during deserialization or other bulk >>>>> operations. >>>>> I wonder if local getters/setters or ThreadLocal variables initialized by >>>>> a global setting wouldn't be a more appropriate option. >>>> I was unaware of the performance issue; I may have missed some emails... >>>> Can >>>> you say how significant it is? If there were no performance issue, would >>>> the >>>> additional function be needed? >>>> >>>> I assume the performance drop is when duplicates are not allowed (the new >>>> default), and some users are wanting to restore the previous performance by >>>> turning on ALLOW_DUP .... Is this correct? >>> I didn't track it in detail, but apparently, some time back Peter noticed a >>> drop in XMI deserialization performance and more recently also in compressed >>> binary CAS deserialization. Some time later, I had a person claiming in >>> private mail that deserialization was O(n^2) with respect to the CAS size. >>> >>> At that point, I had a look at the code and it appears that in the worst >>> case, the duplication check degrades to a linear CAS scan >>> (cf. FSIndexRepositoryImpl line 98ff and FSIntArrayIndex line 101ff). >>> That would if the CAS contains only items that are equal with respect >>> to the index criteria, but not actually equal. >>> >>> Consider a hypothetical annotation type: >>> >>> Metadata extends Annotation { >>> String key; >>> String value; >>> } >>> >>> where the begin/end are always set to 0..documentLength() and >>> key/value have arbitrary values. I didn't try it, but if I >>> understood the code correctly, a CAS containing only such >>> annotations would suffer heavily during the addToIndexes(). >>> >>> Cheers, >>> >>> -- Richard >
