I am finding that the addition of BsvRegexSectionizer to my pipeline (below) has slowed it basically to a halt. Without the sectionizer added, I get ~1000 documents/minute. With that line added, I ran the pipeline for an hour and got no documents annotated. Can anyone suggest what's going wrong here and how to fix it?
FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in the CVD and it worked as expected with only a small-moderate decrease in speed. Thanks, Sean //--------------------------------------------------------------------------------------------------------------- // Description: Commands and parameters to create a default plaintext document processing pipeline with UMLS lookup. Used for back-annotation of existing documents. This takes the top x documents not already existing in the ytex.dbo.document table. // Database Reader // Read documents from a database. reader org.apache.ctakes.ytex.uima.DBCollectionReader queryGetDocumentKeys="EXECUTE Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ _pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id" // using stored procedures for flexibility and to work around buggy regex in PiperFileReader.java // Regex Sectionizer -- added for experiment // Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File. # SectionsBsv path to a BSV file containing a list of regular expressions and corresponding section types. add org.apache.ctakes.core.ae.BsvRegexSectionizer SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv // Load a simple token processing pipeline from another pipeline file load DefaultTokenizerPipeline.piper // Add non-core annotators add ContextDependentTokenizerAnnotator addDescription POSTagger // Add Chunkers load ChunkerSubPipe.piper // Default fast dictionary lookup //add DefaultJCasTermAnnotator // optional: this may improve recall of low-level concepts add OverlapJCasTermAnnotator // Add Cleartk Entity Attribute annotators load AttributeCleartkSubPipe.piper // Optional: this may allow ctakes to do better with finding specific forms of generic terms without needing to add all permutations to dictionary //load RelationSubPipe // XMI Writer 3 // Writes XMI files with full representation of input text and all extracted information. add org.apache.ctakes.ytex.uima.annotators.DBConsumer analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false storeCAS=false typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode