Yes, that's what I had in mind, with a future work item to consider doing the 
same improvement to the other LookupInitializer(s) some day.

-- James

> -----Original Message-----
> From: ctakes-dev-return-1148-Masanz.James=mayo....@incubator.apache.org
> [mailto:ctakes-dev-return-1148-Masanz.James=mayo....@incubator.apache.org]
> On Behalf Of Tim Miller
> Sent: Tuesday, February 05, 2013 10:05 AM
> To: [email protected]
> Subject: Re: assistance with dictionary lookup issue
> 
> OK thanks, that clarifies it (including Sean's concern) for me.  I think
> calling it getSortedLookupTokens is the right idea.  Then it makes it
> clear to the implementing class that at some point sorting has to be done,
> and if its not implicit/free in the implementation (as it is if you base
> it on Annotations), then it needs to be explicit.  Since that can be a
> substantial performance penalty, putting the onus on the implementation
> will hopefully lead to better implementations.
> 
> So to summarize the changes to make sure we're on the same page:
> 
>   * the interface will add a method:
> 
>              public List getSortedLookupTokens(JCas, Annotation);
> 
>   * getLookupTokenIterator() will be reverted to its old version.
> 
>   * FirstTokenPermLookupInitializerImpl will have its
>     getLookupTokenIterator method reverted, and my changes
> 
> will go in the implementation of getSortedLookupTokens()   modulo the
> unnecessary iterator inside.
> 
>   * DictionaryLookupAnnotator will be changed to call
>     getSortedLookupTokens instead of
>     getLookupTokenIterator/constrainToWindow
>   * Other LookupInitializer's will be implemented to simply call
>     existing getLookupTokenIterator and do post-processing to constrain
>     to the window (?)
> 
> Does this match what you had in mind James?  Any objections or things I'm
> missing anyone?
> 
> 
> On 02/05/2013 10:38 AM, Masanz, James J. wrote:
> > First, about the loop - I had been looking too quickly at the diff and
> didn't notice the logic about punctuation etc
> >
> > Second, what I remember when I looked at it before, was seeing interface
> named LookupInitializer, which being old enough, doesn't have Iterator
> parameterized in the definition of the getLookupTokenIterator:
> >      public Iterator getLookupTokenIterator(JCas jcas)
> >              throws AnnotatorInitializationException;
> >
> > and that ends up being effectively an Iterator<LookupToken> and
> LookupToken does not inherit from Annotation, and I stopped at that point.
> >
> > But now looking farther, it looks to me that that's fine because in
> FirstTokenPermLookupInitializerImpl, we look through the BaseTokens and
> create the list of LookupTokens based on the (sorted) BaseTokens, so the
> LookupTokens will be sorted too.
> >
> > so maybe we should call the new method getSortedLookupTokens to make it
> clear they too are sorted
> >
> > ________________________________________
> > From: ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org
> [ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org] on
> behalf of Tim Miller [[email protected]]
> > Sent: Tuesday, February 05, 2013 9:10 AM
> > To: [email protected]
> > Subject: Re: assistance with dictionary lookup issue
> >
> > Yeah, if you mean just change the loop to iterate over the list instead
> > of getting an iterator that makes sense.  There is still some logic in
> > there to leave out punctuation tokens but I think you were implying that
> > to be in your mockup diff.
> >
> > As for sorting, the AnnotationIndex defines a sort order for its
> iterators:
> > http://uima.apache.org/d/uimaj-
> 2.4.0/apidocs/org/apache/uima/cas/text/AnnotationIndex.html
> >
> > so we are safe assuming that anything extending Annotation will be
> > iterated in sorted order.  Does that answer the questions we had? Or was
> > I missing something subtle in that discussion?
> >
> > Tim
> >
> > On 02/05/2013 09:44 AM, Masanz, James J. wrote:
> >> Looks good to me, with one question.
> >>
> >> Instead of getting an iterator and then building a new list, can we
> just skip getting the iterator and use the list that selectCovered
> returns?
> >>
> >> I will mock up a diff here of what I mean:
> >> -     Iterator btaItr = org.uimafit.util.JCasUtil.selectCovered(jcas,
> BaseToken.class, covering).iterator();
> >> -     while (btaItr.hasNext())
> >> -             {
> >> -                     BaseToken bta = (BaseToken) btaItr.next();
> >> -                             ltList.add(lt);
> >> -                     }
> >> -             }
> >>
> >> +     ltList = org.uimafit.util.JCasUtil.selectCovered(jcas,
> BaseToken.class, covering);
> >>
> >>        return ltList;
> >>
> >> I know you said it was quick and dirty at the moment - my 2 cents -
> unless someone comes up with a better engineered solution, I think we
> could add the new method (with a name like getLookupTokens) and leave the
> old one so we don't have to deprecate anything. And phase in the change to
> the various *LookupInitializerImpl classes if needed.
> >>
> >> -- James
> >>
> >>
> >>> -----Original Message-----
> >>> From: ctakes-dev-return-1138-
> [email protected]
> >>> [mailto:ctakes-dev-return-1138-
> [email protected]]
> >>> On Behalf Of Masanz, James J.
> >>> Sent: Monday, February 04, 2013 4:01 PM
> >>> To: [email protected]
> >>> Subject: RE: assistance with dictionary lookup issue
> >>>
> >>> I'll take a look at the patch. Also be aware of
> >>> https://issues.apache.org/jira/browse/CTAKES-31 which talks about a
> way of
> >>> enhancing performance  -- if willing to assume annotations (BaseTokens
> >>> currently) are sorted. Currently it's always BaseToken and always
> sorted,
> >>> just not sure if we want to code to that assumption.
> >>>
> >>> ________________________________________
> >>> From: ctakes-dev-return-1137-
> [email protected]
> >>> [ctakes-dev-return-1137-Masanz.James=mayo....@incubator.apache.org] on
> >>> behalf of Tim Miller [[email protected]]
> >>> Sent: Monday, February 04, 2013 3:43 PM
> >>> To: [email protected]
> >>> Subject: assistance with dictionary lookup issue
> >>>
> >>> Pei helped me track down an issue with performance I'd noticed in the
> >>> dictionary annotator, and I have filed the issue here:
> >>> https://issues.apache.org/jira/browse/CTAKES-143
> >>>
> >>> I implemented a quick and dirty proof of concept fix and noticed
> dramatic
> >>> performance improvement.  I attached the patch to the issue, but it
> >>> involves changing an interface (currently does not try to fix other
> >>> implementing classes so obviously not ready for primetime), so I
> wanted to
> >>> solicit the list first in case anyone with better knowledge of that
> module
> >>> has some better engineering ideas than what I came up with.
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Tim Miller, PhD
> >>> Postdoctoral Research Fellow
> >>> Children's Hospital Informatics Program
> >>> Children's Hospital Boston and Harvard Medical School
> >>> 617-919-1223

Reply via email to