Yes, that's what I had in mind, with a future work item to consider doing the same improvement to the other LookupInitializer(s) some day.
-- James > -----Original Message----- > From: ctakes-dev-return-1148-Masanz.James=mayo....@incubator.apache.org > [mailto:ctakes-dev-return-1148-Masanz.James=mayo....@incubator.apache.org] > On Behalf Of Tim Miller > Sent: Tuesday, February 05, 2013 10:05 AM > To: [email protected] > Subject: Re: assistance with dictionary lookup issue > > OK thanks, that clarifies it (including Sean's concern) for me. I think > calling it getSortedLookupTokens is the right idea. Then it makes it > clear to the implementing class that at some point sorting has to be done, > and if its not implicit/free in the implementation (as it is if you base > it on Annotations), then it needs to be explicit. Since that can be a > substantial performance penalty, putting the onus on the implementation > will hopefully lead to better implementations. > > So to summarize the changes to make sure we're on the same page: > > * the interface will add a method: > > public List getSortedLookupTokens(JCas, Annotation); > > * getLookupTokenIterator() will be reverted to its old version. > > * FirstTokenPermLookupInitializerImpl will have its > getLookupTokenIterator method reverted, and my changes > > will go in the implementation of getSortedLookupTokens() modulo the > unnecessary iterator inside. > > * DictionaryLookupAnnotator will be changed to call > getSortedLookupTokens instead of > getLookupTokenIterator/constrainToWindow > * Other LookupInitializer's will be implemented to simply call > existing getLookupTokenIterator and do post-processing to constrain > to the window (?) > > Does this match what you had in mind James? Any objections or things I'm > missing anyone? > > > On 02/05/2013 10:38 AM, Masanz, James J. wrote: > > First, about the loop - I had been looking too quickly at the diff and > didn't notice the logic about punctuation etc > > > > Second, what I remember when I looked at it before, was seeing interface > named LookupInitializer, which being old enough, doesn't have Iterator > parameterized in the definition of the getLookupTokenIterator: > > public Iterator getLookupTokenIterator(JCas jcas) > > throws AnnotatorInitializationException; > > > > and that ends up being effectively an Iterator<LookupToken> and > LookupToken does not inherit from Annotation, and I stopped at that point. > > > > But now looking farther, it looks to me that that's fine because in > FirstTokenPermLookupInitializerImpl, we look through the BaseTokens and > create the list of LookupTokens based on the (sorted) BaseTokens, so the > LookupTokens will be sorted too. > > > > so maybe we should call the new method getSortedLookupTokens to make it > clear they too are sorted > > > > ________________________________________ > > From: ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org > [ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org] on > behalf of Tim Miller [[email protected]] > > Sent: Tuesday, February 05, 2013 9:10 AM > > To: [email protected] > > Subject: Re: assistance with dictionary lookup issue > > > > Yeah, if you mean just change the loop to iterate over the list instead > > of getting an iterator that makes sense. There is still some logic in > > there to leave out punctuation tokens but I think you were implying that > > to be in your mockup diff. > > > > As for sorting, the AnnotationIndex defines a sort order for its > iterators: > > http://uima.apache.org/d/uimaj- > 2.4.0/apidocs/org/apache/uima/cas/text/AnnotationIndex.html > > > > so we are safe assuming that anything extending Annotation will be > > iterated in sorted order. Does that answer the questions we had? Or was > > I missing something subtle in that discussion? > > > > Tim > > > > On 02/05/2013 09:44 AM, Masanz, James J. wrote: > >> Looks good to me, with one question. > >> > >> Instead of getting an iterator and then building a new list, can we > just skip getting the iterator and use the list that selectCovered > returns? > >> > >> I will mock up a diff here of what I mean: > >> - Iterator btaItr = org.uimafit.util.JCasUtil.selectCovered(jcas, > BaseToken.class, covering).iterator(); > >> - while (btaItr.hasNext()) > >> - { > >> - BaseToken bta = (BaseToken) btaItr.next(); > >> - ltList.add(lt); > >> - } > >> - } > >> > >> + ltList = org.uimafit.util.JCasUtil.selectCovered(jcas, > BaseToken.class, covering); > >> > >> return ltList; > >> > >> I know you said it was quick and dirty at the moment - my 2 cents - > unless someone comes up with a better engineered solution, I think we > could add the new method (with a name like getLookupTokens) and leave the > old one so we don't have to deprecate anything. And phase in the change to > the various *LookupInitializerImpl classes if needed. > >> > >> -- James > >> > >> > >>> -----Original Message----- > >>> From: ctakes-dev-return-1138- > [email protected] > >>> [mailto:ctakes-dev-return-1138- > [email protected]] > >>> On Behalf Of Masanz, James J. > >>> Sent: Monday, February 04, 2013 4:01 PM > >>> To: [email protected] > >>> Subject: RE: assistance with dictionary lookup issue > >>> > >>> I'll take a look at the patch. Also be aware of > >>> https://issues.apache.org/jira/browse/CTAKES-31 which talks about a > way of > >>> enhancing performance -- if willing to assume annotations (BaseTokens > >>> currently) are sorted. Currently it's always BaseToken and always > sorted, > >>> just not sure if we want to code to that assumption. > >>> > >>> ________________________________________ > >>> From: ctakes-dev-return-1137- > [email protected] > >>> [ctakes-dev-return-1137-Masanz.James=mayo....@incubator.apache.org] on > >>> behalf of Tim Miller [[email protected]] > >>> Sent: Monday, February 04, 2013 3:43 PM > >>> To: [email protected] > >>> Subject: assistance with dictionary lookup issue > >>> > >>> Pei helped me track down an issue with performance I'd noticed in the > >>> dictionary annotator, and I have filed the issue here: > >>> https://issues.apache.org/jira/browse/CTAKES-143 > >>> > >>> I implemented a quick and dirty proof of concept fix and noticed > dramatic > >>> performance improvement. I attached the patch to the issue, but it > >>> involves changing an interface (currently does not try to fix other > >>> implementing classes so obviously not ready for primetime), so I > wanted to > >>> solicit the list first in case anyone with better knowledge of that > module > >>> has some better engineering ideas than what I came up with. > >>> > >>> Thanks, > >>> > >>> -- > >>> Tim Miller, PhD > >>> Postdoctoral Research Fellow > >>> Children's Hospital Informatics Program > >>> Children's Hospital Boston and Harvard Medical School > >>> 617-919-1223
