Looks good to me. -----Original Message----- From: Masanz, James J. [mailto:[email protected]] Sent: Tuesday, February 05, 2013 11:09 AM To: '[email protected]' Subject: RE: assistance with dictionary lookup issue
Yes, that's what I had in mind, with a future work item to consider doing the same improvement to the other LookupInitializer(s) some day. -- James > -----Original Message----- > From: > ctakes-dev-return-1148-Masanz.James=mayo....@incubator.apache.org > [mailto:[email protected]. > org] > On Behalf Of Tim Miller > Sent: Tuesday, February 05, 2013 10:05 AM > To: [email protected] > Subject: Re: assistance with dictionary lookup issue > > OK thanks, that clarifies it (including Sean's concern) for me. I > think calling it getSortedLookupTokens is the right idea. Then it > makes it clear to the implementing class that at some point sorting > has to be done, and if its not implicit/free in the implementation (as > it is if you base it on Annotations), then it needs to be explicit. > Since that can be a substantial performance penalty, putting the onus > on the implementation will hopefully lead to better implementations. > > So to summarize the changes to make sure we're on the same page: > > * the interface will add a method: > > public List getSortedLookupTokens(JCas, Annotation); > > * getLookupTokenIterator() will be reverted to its old version. > > * FirstTokenPermLookupInitializerImpl will have its > getLookupTokenIterator method reverted, and my changes > > will go in the implementation of getSortedLookupTokens() modulo the > unnecessary iterator inside. > > * DictionaryLookupAnnotator will be changed to call > getSortedLookupTokens instead of > getLookupTokenIterator/constrainToWindow > * Other LookupInitializer's will be implemented to simply call > existing getLookupTokenIterator and do post-processing to constrain > to the window (?) > > Does this match what you had in mind James? Any objections or things > I'm missing anyone? > > > On 02/05/2013 10:38 AM, Masanz, James J. wrote: > > First, about the loop - I had been looking too quickly at the diff > > and > didn't notice the logic about punctuation etc > > > > Second, what I remember when I looked at it before, was seeing > > interface > named LookupInitializer, which being old enough, doesn't have Iterator > parameterized in the definition of the getLookupTokenIterator: > > public Iterator getLookupTokenIterator(JCas jcas) > > throws AnnotatorInitializationException; > > > > and that ends up being effectively an Iterator<LookupToken> and > LookupToken does not inherit from Annotation, and I stopped at that point. > > > > But now looking farther, it looks to me that that's fine because in > FirstTokenPermLookupInitializerImpl, we look through the BaseTokens > and create the list of LookupTokens based on the (sorted) BaseTokens, > so the LookupTokens will be sorted too. > > > > so maybe we should call the new method getSortedLookupTokens to make > > it > clear they too are sorted > > > > ________________________________________ > > From: > > ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org > [ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org] on > behalf of Tim Miller [[email protected]] > > Sent: Tuesday, February 05, 2013 9:10 AM > > To: [email protected] > > Subject: Re: assistance with dictionary lookup issue > > > > Yeah, if you mean just change the loop to iterate over the list > > instead of getting an iterator that makes sense. There is still > > some logic in there to leave out punctuation tokens but I think you > > were implying that to be in your mockup diff. > > > > As for sorting, the AnnotationIndex defines a sort order for its > iterators: > > http://uima.apache.org/d/uimaj- > 2.4.0/apidocs/org/apache/uima/cas/text/AnnotationIndex.html > > > > so we are safe assuming that anything extending Annotation will be > > iterated in sorted order. Does that answer the questions we had? Or > > was I missing something subtle in that discussion? > > > > Tim > > > > On 02/05/2013 09:44 AM, Masanz, James J. wrote: > >> Looks good to me, with one question. > >> > >> Instead of getting an iterator and then building a new list, can we > just skip getting the iterator and use the list that selectCovered > returns? > >> > >> I will mock up a diff here of what I mean: > >> - Iterator btaItr = org.uimafit.util.JCasUtil.selectCovered(jcas, > BaseToken.class, covering).iterator(); > >> - while (btaItr.hasNext()) > >> - { > >> - BaseToken bta = (BaseToken) btaItr.next(); > >> - ltList.add(lt); > >> - } > >> - } > >> > >> + ltList = org.uimafit.util.JCasUtil.selectCovered(jcas, > BaseToken.class, covering); > >> > >> return ltList; > >> > >> I know you said it was quick and dirty at the moment - my 2 cents - > unless someone comes up with a better engineered solution, I think we > could add the new method (with a name like getLookupTokens) and leave > the old one so we don't have to deprecate anything. And phase in the > change to the various *LookupInitializerImpl classes if needed. > >> > >> -- James > >> > >> > >>> -----Original Message----- > >>> From: ctakes-dev-return-1138- > [email protected] > >>> [mailto:ctakes-dev-return-1138- > [email protected]] > >>> On Behalf Of Masanz, James J. > >>> Sent: Monday, February 04, 2013 4:01 PM > >>> To: [email protected] > >>> Subject: RE: assistance with dictionary lookup issue > >>> > >>> I'll take a look at the patch. Also be aware of > >>> https://issues.apache.org/jira/browse/CTAKES-31 which talks about > >>> a > way of > >>> enhancing performance -- if willing to assume annotations > >>> (BaseTokens > >>> currently) are sorted. Currently it's always BaseToken and always > sorted, > >>> just not sure if we want to code to that assumption. > >>> > >>> ________________________________________ > >>> From: ctakes-dev-return-1137- > [email protected] > >>> [ctakes-dev-return-1137-Masanz.James=mayo....@incubator.apache.org > >>> ] on behalf of Tim Miller [[email protected]] > >>> Sent: Monday, February 04, 2013 3:43 PM > >>> To: [email protected] > >>> Subject: assistance with dictionary lookup issue > >>> > >>> Pei helped me track down an issue with performance I'd noticed in > >>> the dictionary annotator, and I have filed the issue here: > >>> https://issues.apache.org/jira/browse/CTAKES-143 > >>> > >>> I implemented a quick and dirty proof of concept fix and noticed > dramatic > >>> performance improvement. I attached the patch to the issue, but > >>> it involves changing an interface (currently does not try to fix > >>> other implementing classes so obviously not ready for primetime), > >>> so I > wanted to > >>> solicit the list first in case anyone with better knowledge of > >>> that > module > >>> has some better engineering ideas than what I came up with. > >>> > >>> Thanks, > >>> > >>> -- > >>> Tim Miller, PhD > >>> Postdoctoral Research Fellow > >>> Children's Hospital Informatics Program Children's Hospital Boston > >>> and Harvard Medical School > >>> 617-919-1223
