BsvRegexSectionizer breaks my pipeline

Mullane, Sean *HS Thu, 01 Mar 2018 12:59:10 -0800

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) 
has slowed it basically to a halt. Without the sectionizer added, I get ~1000 
documents/minute. With that line added, I ran the pipeline for an hour and got 
no documents annotated. Can anyone suggest what's going wrong here and how to 
fix it?


FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in 
the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document 
processing pipeline with UMLS lookup. Used for back-annotation of existing 
documents. This takes the top x documents not already existing in the 
ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader 
queryGetDocumentKeys="EXECUTE 
Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ 
_pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC 
YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in 
PiperFileReader.java

//  Regex Sectionizer -- added for experiment
//  Annotates Document Sections by detecting Section Headers using Regular 
Expressions provided in a Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions 
and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer 
SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file
load DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts
add OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators
load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of 
generic terms without needing to add all permutations to dictionary
//load RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted 
information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer 
analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false 
storeCAS=false  
typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode

BsvRegexSectionizer breaks my pipeline

Reply via email to