I think that Kean is correct.  I usually create an annotator that removes terms 
that I don't want.  It is usually fairly easy.

      final Predicate<IdentifiedAnnotation> is2char
            = a -> a.getCoveredText().length() == 2;

      final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name();

      OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
                         .stream()
                         .filter( is2char )
                         .forEach( Annotation::removeFromIndexes );


Or, if you want to grab a few that aren't specifically "Gene" but are in the 
same semantic group (without looking it up in class SemanticGroup), and in the 
HGNC vocabulary :

      final Class<? extends IdentifiedAnnotation> geneClass
            = SemanticTui.getTui( "Gene or Genome" )
                         .getGroup()
                         .getCtakesClass();

      final Predicate<IdentifiedAnnotation> isHgnc
            = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey( "hgnc" 
);

      JCasUtil.select( jCas, geneClass )
              .stream()
              .filter( is2char )
              .filter( isHgnc )
              .forEach( Annotation::removeFromIndexes );


"hgnc" may need to be "HGNC" ... and will only exist if you stored the HGNC 
codes in your dictionary.


Or you can do it focusing on what you do want.  

      final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of( 
SemanticGroup.DRUG, SemanticGroup.LAB );
      
      final Predicate<IdentifiedAnnotation> isTrashGroup
            = a -> SemanticGroup.getGroups( a )
                                .stream()
                                .noneMatch( WANTED_GROUP::contains );
      
      JCasUtil.select( jCas, IdentifiedAnnotation.class )
              .stream()
              .filter( is2char )
              .filter( isTrashGroup )
              .forEach( Annotation::removeFromIndexes );

Or if you want to cover all combinations that aren't all uppercase:

      final Predicate<IdentifiedAnnotation> notCaps
            = a -> a.getCoveredText()
                    .chars()
                    .anyMatch( Character::isLowerCase );

      JCasUtil.select( jCas, IdentifiedAnnotation.class )
              .stream()
              .filter( is2char )
              .filter( notCaps )
              .forEach( Annotation::removeFromIndexes );

Or mix and modify.  For instance, ignore character length but  Tui = Gene and 
the text is not all caps.  

Sometimes I enjoy mocking up code ...

Sean

________________________________________
From: Kean Kaufmann <k...@recordsone.com>
Sent: Monday, August 24, 2020 9:35 PM
To: dev@ctakes.apache.org
Subject: Re: Question about window size in term lookup [EXTERNAL]

* External Email - Caution *


>
> my question is whether there's a place where one can register specific two
> character terms, for example BP or PT which will be found even with a
> window size set to three.


My brute-force approach is pretty brutal: Change the window size to two,
annotate terms, then remove all two-letter annotations except the very few
I'm interested in.

On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <pabramowit...@gmail.com>
wrote:

> Hello all
>
> Is there a mechanism, a lookup file, etc which overrides the window size
> set on the term annotator or the chunker.   Changing the window size from
> the default of 3 to 2 opens the floodgate to false acronym annotations.  So
> my question is whether there's a place where one can register specific two
> character terms, for example BP or PT which will be found even with a
> window size set to three.
>
> A similar question about Genes.   On adding the HGNC vocabulary I notice
> that there are many thousands of aliases for genes which overlap other
> common acronyms and english words such as trip, spring, plan, bed, yes,
> rip, prn etc.   I'm not sure if these aliases are ever used.   So I created
> a sed script with 4000 regex expressions to remove the 2 and 3 letter gene
> synonyms from a script file.  I will only suppress the 4 letter synonyms
> manually where they cause trouble.     But does anyone have a  more elegant
> solution?
>
> Peter
>

Reply via email to