Hi Chris,
I use bsv to denote "bar separated value" - also known as "pipe delimited". I
typically name the files with a ".bsv" extension, and they are just plain old
boring ascii flat files.
There should be multiple columns in the bsv file separated by the '|'
character. The following are all valid per-line formats:
CUI|text
CUI|TUI|text
CUI|TUI|text|preferredText
It doesn't matter which format you choose, the parser will auto-detect
per-line. Starting a line with "//" or "#" indicates that it is a comment and
should be ignored.
To add the bsv dictionary to your pipeline you just need to edit the
resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml file and add
a couple new sections.
Within the <dictionaries> section, add:
<dictionary>
<name>CustomCuiRareWord</name>
<implementationName>org.apache.ctakes.dictionary.lookup2.dictionary.BsvRareWordDictionary</implementationName>
<properties>
<property key="bsvPath"
value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>
</properties>
</dictionary>
Within the <conceptFactories> section, add:
<conceptFactory>
<name>CustomCuiConcept</name>
<implementationName>org.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory</implementationName>
<properties>
<property key="bsvPath"
value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>
</properties>
</conceptFactory>
Within the <dictionaryConceptPairs> section, add:
<dictionaryConceptPair>
<name>CustomPair</name>
<dictionaryName>CustomCuiRareWord</dictionaryName>
<conceptFactoryName>CustomCuiConcept</conceptFactoryName>
</dictionaryConceptPair>
You can change all of the [Custom**] names, and you should obviously point to
the actual path of your bsv file.
In addition to detecting your column count/style, upon loading the text will be
lower-cased and tokenized and the terms will be indexed by rare word (for fast
lookup). Also, you do not need to write out the whole "C1234567" or "T123"
cui tui codes. The default prefix characters and padding zeros are
automatically added. Cuis "1" "01" "C1" "C01" will all be stored as
"C0000001" and Tuis are handled likewise. If you have custom cuis then it will
honor non-"C" prefixes and still pad zeros automatically based upon the longest
entry. For instance, if your bsv has "CAM1", "CAM12" and "CAM12345" then the
stored custom cuis should be "CAM00001", "CAM00012" and "CAM13245".
I think that is about all that there is to it ...
Sean
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:[email protected]]
Sent: Tuesday, October 06, 2015 4:31 PM
To: [email protected]
Subject: Re: How to update cTAKES so that new top level categories come out
based on local dictionary?
Hi Sean,
Thanks so much for your reply. For now I don’t care about the secondary
codes and I for sure have < 1000 terms. Can you tell me how to wire up
the BSV file by editing specific places in cTAKES? What specific commands
should I run or what format should the BSV file look like? I must admit
I have never heard of BSV files and the Internet varies on these between
Bluespec System Verilog and BASIC bsave files.
Then after I make the BSV file, what steps next? Recompile cTAKES? Can
I take the BSV file and simply point to it from a binary installation of
cTAKES? Thank you!
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:
https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Emattmann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=bLdoNVceobXShsqfGFdPDKSiq2WNSUbGDHdvmrfMj10&s=CXhGiFUuPnSekOe4GnsuxPOgYHbNp-hAnOD8jmB-lgc&e=
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: "Finan, Sean" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, October 6, 2015 at 8:05 AM
To: "[email protected]" <[email protected]>
Subject: RE: How to update cTAKES so that new top level categories come
out based on local dictionary?
>Hi Chris,
>
>There are a few ways to do this:
>1. Create an additional dictionary with the terms of interest and add it
>as a source
>2. Create a new dictionary hsqldb that contains everything, old and new
>3. Add to the existing hsqldb dictionary
>
>The best approach for you would probably depend upon
>1. How many new terms you have
>2. Whether or not you desire additional codes, i.e. rxnorm, snomed
>
>If you don't have many new terms (<1000) and you don't care about
>secondary codes then the easiest thing would be to create a BSV file with
>the new terms and cuis.
>
>If you have a lot of new terms or do care about secondary codes, then a
>less facile solution would be to create a new hsqldb with only the new
>info or a complete replacement with new and old/existing terms. Of the
>two hsql options creating a new all-inclusive database would probably be
>easier unless you want to learn the ins and outs of hsql. If all of the
>terms are in the umls, then the new all-inclusive hsqldb would definitely
>be easiest (I think) as you could use the dictionary tool to create it.
>
>If you let me know your exact situation then I may be able to better
>expound.
>
>Sean
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:[email protected]]
>Sent: Monday, October 05, 2015 7:36 PM
>To: [email protected]
>Subject: How to update cTAKES so that new top level categories come out
>based on local dictionary?
>
>Hi cTAKES team,
>
>
>
>Hope you’re well! I had a quick question. I was wondering if someone
>
>could provide me a step-by-step guide to updating cTAKES to be based
>
>off a local dictionary, so that in addition to e.g.,
>
>
>
>ProceduralMention
>
> Value1 position etc
>
> Value2 position etc
>
>
>
>MedicationMention
>
> Value1 position etc
>
> Value2 position etc
>
>
>
>
>
>NewTopLevelCategoryFromMyDictionary
>
> FoundValue1 position etc
>
> FoundValue2 position etc
>
>
>
>
>
>I realize this has something to do with updating the annotation
>
>descriptions etc in XML, so if I someone just could tell me what
>
>to update I’d really appreciate it.
>
>
>
>Thank you!
>
>
>
>Cheers,
>
>Chris
>
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>Chris Mattmann, Ph.D.
>
>Chief Architect
>
>Instrument Software and Science Data Systems Section (398)
>
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
>Office: 168-519, Mailstop: 168-527
>
>Email: [email protected]
>
>WWW:
>https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Ematt
>mann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZst
>TpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=MEZE0aOE5pBHul1QA3A9xWbiwS6LzZaIq2rMw9a
>jiB0&s=cvi79MY1__guvBRsQmsYfc39lqPvv-1Yx1Pg8g5B0QU&e=
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>Adjunct Associate Professor, Computer Science Department
>
>University of Southern California, Los Angeles, CA 90089 USA
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>