New patch with these changes attached to ctakes-143.
Tim

On 02/05/2013 11:14 AM, Finan, Sean wrote:
Looks good to me.

-----Original Message-----
From: Masanz, James J. [mailto:[email protected]]
Sent: Tuesday, February 05, 2013 11:09 AM
To: '[email protected]'
Subject: RE: assistance with dictionary lookup issue

Yes, that's what I had in mind, with a future work item to consider doing the 
same improvement to the other LookupInitializer(s) some day.

-- James

-----Original Message-----
From:
ctakes-dev-return-1148-Masanz.James=mayo....@incubator.apache.org
[mailto:[email protected].
org]
On Behalf Of Tim Miller
Sent: Tuesday, February 05, 2013 10:05 AM
To: [email protected]
Subject: Re: assistance with dictionary lookup issue

OK thanks, that clarifies it (including Sean's concern) for me.  I
think calling it getSortedLookupTokens is the right idea.  Then it
makes it clear to the implementing class that at some point sorting
has to be done, and if its not implicit/free in the implementation (as
it is if you base it on Annotations), then it needs to be explicit.
Since that can be a substantial performance penalty, putting the onus
on the implementation will hopefully lead to better implementations.

So to summarize the changes to make sure we're on the same page:

   * the interface will add a method:

              public List getSortedLookupTokens(JCas, Annotation);

   * getLookupTokenIterator() will be reverted to its old version.

   * FirstTokenPermLookupInitializerImpl will have its
     getLookupTokenIterator method reverted, and my changes

will go in the implementation of getSortedLookupTokens()   modulo the
unnecessary iterator inside.

   * DictionaryLookupAnnotator will be changed to call
     getSortedLookupTokens instead of
     getLookupTokenIterator/constrainToWindow
   * Other LookupInitializer's will be implemented to simply call
     existing getLookupTokenIterator and do post-processing to constrain
     to the window (?)

Does this match what you had in mind James?  Any objections or things
I'm missing anyone?


On 02/05/2013 10:38 AM, Masanz, James J. wrote:
First, about the loop - I had been looking too quickly at the diff
and
didn't notice the logic about punctuation etc
Second, what I remember when I looked at it before, was seeing
interface
named LookupInitializer, which being old enough, doesn't have Iterator
parameterized in the definition of the getLookupTokenIterator:
      public Iterator getLookupTokenIterator(JCas jcas)
              throws AnnotatorInitializationException;

and that ends up being effectively an Iterator<LookupToken> and
LookupToken does not inherit from Annotation, and I stopped at that point.
But now looking farther, it looks to me that that's fine because in
FirstTokenPermLookupInitializerImpl, we look through the BaseTokens
and create the list of LookupTokens based on the (sorted) BaseTokens,
so the LookupTokens will be sorted too.
so maybe we should call the new method getSortedLookupTokens to make
it
clear they too are sorted
________________________________________
From:
ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org
[ctakes-dev-return-1146-Masanz.James=mayo....@incubator.apache.org] on
behalf of Tim Miller [[email protected]]
Sent: Tuesday, February 05, 2013 9:10 AM
To: [email protected]
Subject: Re: assistance with dictionary lookup issue

Yeah, if you mean just change the loop to iterate over the list
instead of getting an iterator that makes sense.  There is still
some logic in there to leave out punctuation tokens but I think you
were implying that to be in your mockup diff.

As for sorting, the AnnotationIndex defines a sort order for its
iterators:
http://uima.apache.org/d/uimaj-
2.4.0/apidocs/org/apache/uima/cas/text/AnnotationIndex.html
so we are safe assuming that anything extending Annotation will be
iterated in sorted order.  Does that answer the questions we had? Or
was I missing something subtle in that discussion?

Tim

On 02/05/2013 09:44 AM, Masanz, James J. wrote:
Looks good to me, with one question.

Instead of getting an iterator and then building a new list, can we
just skip getting the iterator and use the list that selectCovered
returns?
I will mock up a diff here of what I mean:
-     Iterator btaItr = org.uimafit.util.JCasUtil.selectCovered(jcas,
BaseToken.class, covering).iterator();
-     while (btaItr.hasNext())
-             {
-                     BaseToken bta = (BaseToken) btaItr.next();
-                             ltList.add(lt);
-                     }
-             }

+     ltList = org.uimafit.util.JCasUtil.selectCovered(jcas,
BaseToken.class, covering);
        return ltList;

I know you said it was quick and dirty at the moment - my 2 cents -
unless someone comes up with a better engineered solution, I think we
could add the new method (with a name like getLookupTokens) and leave
the old one so we don't have to deprecate anything. And phase in the
change to the various *LookupInitializerImpl classes if needed.
-- James


-----Original Message-----
From: ctakes-dev-return-1138-
[email protected]
[mailto:ctakes-dev-return-1138-
[email protected]]
On Behalf Of Masanz, James J.
Sent: Monday, February 04, 2013 4:01 PM
To: [email protected]
Subject: RE: assistance with dictionary lookup issue

I'll take a look at the patch. Also be aware of
https://issues.apache.org/jira/browse/CTAKES-31 which talks about
a
way of
enhancing performance  -- if willing to assume annotations
(BaseTokens
currently) are sorted. Currently it's always BaseToken and always
sorted,
just not sure if we want to code to that assumption.

________________________________________
From: ctakes-dev-return-1137-
[email protected]
[ctakes-dev-return-1137-Masanz.James=mayo....@incubator.apache.org
] on behalf of Tim Miller [[email protected]]
Sent: Monday, February 04, 2013 3:43 PM
To: [email protected]
Subject: assistance with dictionary lookup issue

Pei helped me track down an issue with performance I'd noticed in
the dictionary annotator, and I have filed the issue here:
https://issues.apache.org/jira/browse/CTAKES-143

I implemented a quick and dirty proof of concept fix and noticed
dramatic
performance improvement.  I attached the patch to the issue, but
it involves changing an interface (currently does not try to fix
other implementing classes so obviously not ready for primetime),
so I
wanted to
solicit the list first in case anyone with better knowledge of
that
module
has some better engineering ideas than what I came up with.

Thanks,

--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program Children's Hospital Boston
and Harvard Medical School
617-919-1223

Reply via email to