[jira] [Created] (CTAKES-16) use uimaFIT's selectCovered() instead of UIMA's subiterator

Pei Chen (JIRA) Thu, 09 Aug 2012 13:39:42 -0700

Pei Chen created CTAKES-16:
------------------------------

             Summary: use uimaFIT's selectCovered() instead of UIMA's 
subiterator
                 Key: CTAKES-16
                 URL: https://issues.apache.org/jira/browse/CTAKES-16
             Project: cTAKES
          Issue Type: Improvement
          Components: ctakes-assertion, ctakes-chunker, 
ctakes-clinical-pipeline, ctakes-context-tokenizer, ctakes-core, 
ctakes-dependency-parser, ctakes-ne-contexts, ctakes-pos-tagger
            Reporter: Pei Chen
            Priority: Minor



Could not get consistent results from .subiterator when using uimaFIT with the 
cTAKES GUI (which wires the components together dynamically).

To get all the BaseTokens for a particular sentence, if we use the 
.subiterator, the types has be stored in the FSindexes in a certain order 
otherwise it could just return an empty list.  This would require the users of 
annotators to understand the ordering of types and have it preconfigured.

FSIterator<Annotation> tokensInSentenceIterator = 
jcas.getAnnotationIndex(BaseToken.type).subiterator(sentence);

uimaFIT already created a convenience method that seems to do something similar 
which will always return the expected tokens.  Does anyone know if this was 
part of the motivation?  Is the performance hit (if any) worth the ease of use?
Ex:
List<BaseToken> tokens = org.uimafit.util.JCasUtil.selectCovered(jCas, 
BaseToken.class, sentence); Another alternative is UIMA's FilteredIterator.

There are a few places that use subiterator in cTAKES and it's tempting to use 
uimaFIT's JCasUtil.selecteCovered() instead... What do others think?

Background: This issue surfaced when we use the cTAKES GUI (which uses uimaFIT 
to wire the components together instead of the Aggregate XML descriptor).

--Pei

On Aug 9, 2012, at 9:18 AM, Chen, Pei wrote:
To get all the BaseTokens for a particular sentence, if we use the .subiterator,
the types has be stored in the FSindexes in a certain order otherwise it could
just return an empty list.  This would require the users of annotators to
understand the ordering of types and have it preconfigured.

FSIterator<Annotation> tokensInSentenceIterator =
jcas.getAnnotationIndex(BaseToken.type).subiterator(sentence);

uimaFIT already created a convenience method that seems to do something similar
which will always return the expected tokens.  Does anyone know if this was part
of the motivation?

Yes, that was exactly the motivation to avoid using subiterators. Our experience
in uimaFIT was that subiterators never did what you wanted them to do.

Is the performance hit (if any) worth the ease of use?

I doubt there's a performance hit. Take a look at the source for
JCasUtil.selectCovered vs. org.apache.uima.cas.impl.Subiterator. If anything,
selectCovered is probably doing less.

But of course you could time it and find out for sure.

Steve
Full discussion thread could be found here: 
http://markmail.org/search/+list:org.apache.incubator.ctakes-dev#query:%20list%3Aorg.apache.incubator.ctakes-dev+page:1+mid:hcp3rudjelddo2dy+state:results


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CTAKES-16) use uimaFIT's selectCovered() instead of UIMA's subiterator

Reply via email to