RE: HSQLDB out of memory with custom dictionary

Gandhi Rajan Natarajan Fri, 13 Oct 2017 23:18:50 -0700

Hi Kathy,

When you use the dictionary generator GUI to generate the custom dictionary, it 
will create a XML file under 
<CTAKES_HOME>\resources\org\apache\ctakes\dictionary\lookup\fast folder. Say fo 
r example, while generating custom dictionary if you would have given the name 
as 'customdictionary', then you will find an xml file with name 
''customdictionary.xml' in the above mentioned folder.


Once you got the XML file, you gotta change the HSQLDB specifications to MySQL 
specifications and refer the same in Pipeline.java. Please find a  sample 
custom dictionary XML file in the following link - 
https://github.com/gandhirajan/cTAKES/tree/master/customDictionaryXML

Hope it helps. Cheers.

Regards,
Gandhi


-----Original Message-----
From: Kathy Ferro [mailto:[email protected]]
Sent: Saturday, October 14, 2017 12:57 AM
To: [email protected]
Subject: Re: HSQLDB out of memory with custom dictionary

Gandhi,

Thanks again for your response.

I am pretty new with ctakes myself and my Java knowledge is not up to dated.

I am looking at the sample source code from https://github.com/healthnlp/ 
examples/tree/master/ctakes-temporal-demo.  In pipeline.java, it looks like it 
changes the dictionary name only.

       builder.add( AnalysisEngineFactory.createEngineDescription(
DefaultJCasTermAnnotator.class,
                AbstractJCasTermAnnotator.PARAM_WINDOW_ANNOT_KEY,
                "org.apache.ctakes.typesystem.type.textspan.Sentence",
                JCasTermAnnotator.DICTIONARY_DESCRIPTOR_KEY,
                "org/apache/ctakes/dictionary/lookup/fast/sno_rx_16ab.xml")
       );


1. Do I change to MySQL driver in (dictionary).xml? Below is the code snip.
2, What do I do with the blue highlight?
3. If I leave hsqldb, would that just use the hsqldb script file?
4. If I change it, do you have sample?

Right now, I run the pipeline using the new dictionary with this option "-l 
org/apache/ctakes/dictionary/lookup/fast/(dictionary name).xml" which loads the 
dictionary into hsqldb memory.


         <property key="jdbcDriver" value="org.hsqldb.jdbcDriver"/>
         <property key="jdbcUrl" value="jdbc:hsqldb:file 
:src/main/resources/org/apache/ctakes/dictionary/lookup/fast/sno_rx_16ab/sno_rx_16ab"/>


I'm very appreciated your help.
Kathy



On Wed, Oct 11, 2017 at 5:14 PM, Kathy Ferro <[email protected]>
wrote:

> Gandhi and Matthew,
>
> Thank you for the information.
>
> Kathy
>
> On Wed, Oct 11, 2017 at 1:35 AM, Gandhi Rajan Natarajan <
> [email protected]> wrote:
>
>> Hi Matthew,
>>
>> Please check out my response to Kathy. If feel that has the required
>> info to start off. Please let me know if you are looking for any
>> specific additional info.
>>
>> Regards,
>> Gandhi
>>
>>
>> -----Original Message-----
>> From: Matthew Vita [mailto:[email protected]]
>> Sent: Wednesday, October 11, 2017 11:00 AM
>> To: [email protected]
>> Subject: Re: HSQLDB out of memory with custom dictionary
>>
>> Hi Kathy and Gandhi,
>>
>> I started to put together a more formal solution for this here:
>> https://github.com/GoTeamEpsilon/cTAKES-HSQLDB-to-MySQL-Dictionary -
>> It is not perfect but it makes things a bit easier. I was able to
>> load in millions of records into MySQL, which is awesome!
>>
>> *If you have a non-trivial dictionary, chances are you will exhaust
>> HSQLDB's capabilities. By using this solution, you will have a MySQL
>> schema filled up with what would have been the HSQLDB data.*
>>
>> *This solution uses lazy lists and streams to keep memory usage low
>> when the script files are huge.*
>>
>> I have not got it working with the XML jdbc configuration yet so if
>> you (or anyone else) could share an example that would be amazing.
>>
>> Thanks,
>>
>> Matthew Vita
>> www.matthewvita.com
>>
>> On Tue, Oct 10, 2017 at 9:57 PM, Gandhi Rajan Natarajan <
>> [email protected]> wrote:
>>
>> > Hi Kathy,
>> >
>> > Good to hear from you. Please find the response below.
>> >
>> > NOTE: This is based on my experience with cTAKES so far. Please
>> > correct me if someone find the answers to be wrong.
>> >
>> > 1. Does it matter what the name of the database?
>> >
>> > Name of the database really don’t matter. But the name you have
>> > created should be mapped in the Dictionary GUI generated XML file's
>> 'jdbcurl'
>> > property.
>> >
>> > 2. What configuration file do I change to switch to use the new
>> database?
>> >
>> > If you are using the example downloaded from
>> > https://github.com/healthnlp/
>> > examples/tree/master/ctakes-temporal-demo , then in Pipeline.java
>> > you gotta map the XML file name generated using the Dictionary GUI
>> > instead
>> of 'sno_rx_16ab.xml'
>> >
>> > If you want to use the new database for CVD, then you got to change '
>> > DEFAULT_DICT_DESC_PATH' to point to the new XML file in
>> > JCasTermAnnotator.java and rebuild ctakes-dictionary-lookup-fast
>> > module and use the jar file.
>> >
>> > 3) Do you think I can use SQL server instead of MySQL?  My SQL
>> > seems to run faster.
>> >
>> > This choice is user specific and I can't comment on performance
>> > comparison as I have no clue on this.
>> >
>> >
>> >
>> > Regards,
>> > Gandhi
>> >
>> >
>> > -----Original Message-----
>> > From: Kathy Ferro [mailto:[email protected]]
>> > Sent: Tuesday, October 10, 2017 9:26 PM
>> > To: [email protected]
>> > Subject: Re: HSQLDB out of memory with custom dictionary
>> >
>> > Gandhi,
>> >
>> > My name is Kathy Ferro.
>> >
>> > Matthew and I are trying to accomplish the thing.  I got the
>> > scripts loaded into both SQL server and MySQL.  I did it in two ways.
>> > 1. Manually modifier the scripts for DB specific and run them in
>> > query analyzer window as you described.  Works find if the data is
>> > small
>> enough.
>> > For bigger file, it looks up.
>> > 2. I wrote c# program to read the scripts and insert records one by
>> > one I re-load them.
>> >
>> > My question for you are:
>> >
>> > 2. What configuration file do I change to switch to use the new
>> database?
>> > 3. Do you think I can use SQL server instead of MySQL?  My SQL
>> > seems to run faster.
>> >
>> > Thank
>> > Kathy
>> >
>> >
>> >
>> >
>> > On Tue, Oct 10, 2017 at 2:34 AM, Gandhi Rajan Natarajan <
>> > [email protected]> wrote:
>> >
>> > > Hi Matthew,
>> > >
>> > > The SQLs looks fine. The only additional table I'm using apart
>> > > from the tables mentioned below is MDR table (MEDDRA related) and
>> > > I don’t use AIR table.
>> > >
>> > > Do you really think you need a JAVA program to convert those
>> > > insert statements to work with MySQL? I just opened the script
>> > > file in text editor like Editplus and did a find for `[\)]\n` and
>> > > replaced it with `);\n` using find and replace all option with
>> > > REGEX and we are done with
>> > the scripts.
>> > >
>> > > But only thing is you can load the data in parallel by splitting
>> > > the script files as mentioned earlier which saves times for you
>> > > and may be you can write a JAVA program to split the file. This
>> > > is the easiest approach I feel.
>> > >
>> > > Regards,
>> > > Gandhi
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Matthew Vita [mailto:[email protected]]
>> > > Sent: Tuesday, October 10, 2017 10:47 AM
>> > > To: [email protected]
>> > > Subject: Re: HSQLDB out of memory with custom dictionary
>> > >
>> > > Gandhi,
>> > >
>> > > I really appreciate this information. I have started working out
>> > > the schema and plan on writing a program that will automatically
>> > > prepare a script to work with MySQL. Work in progress. Can you do
>> > > a quick review of my MySQL schema so far?
>> > >
>> > > CREATE SCHEMA CTAKES_DATA;
>> > >
>> > > use CTAKES_DATA;
>> > >
>> > > CREATE TABLE CUI_TERMS (
>> > >   CUI BIGINT NOT NULL,
>> > >   RINDEX INT(128) NOT NULL,
>> > >   TCOUNT INT(128) NOT NULL,
>> > >   TEXT VARCHAR(255) NOT NULL,
>> > >   RWORD VARCHAR(48) NOT NULL
>> > > );
>> > > CREATE INDEX IDX_CUI_TERMS ON CUI_TERMS (RWORD);
>> > >
>> > > CREATE TABLE TUI (
>> > >   CUI BIGINT NOT NULL,
>> > >   TUI INT(128) NOT NULL
>> > > );
>> > > CREATE INDEX IDX_TUI ON TUI (CUI);
>> > >
>> > > CREATE TABLE PREFTERM (
>> > >   CUI BIGINT NOT NULL,
>> > >   PREFTERM VARCHAR(511) NOT NULL
>> > > );
>> > > CREATE INDEX IDX_PREFTERM ON PREFTERM (CUI);
>> > >
>> > > CREATE TABLE RXNORM (
>> > >   CUI BIGINT NOT NULL,
>> > >   RXNORM BIGINT NOT NULL
>> > > );
>> > > CREATE INDEX IDX_RXNORM ON RXNORM (CUI);
>> > >
>> > > CREATE TABLE SNOMEDCT_US (
>> > >   CUI BIGINT NOT NULL,
>> > >   SNOMEDCT_US BIGINT NOT NULL
>> > > );
>> > > CREATE INDEX IDX_SNOMEDCT_US ON SNOMEDCT_US (CUI);
>> > >
>> > > Quick question: do you use the AIR table?
>> > >
>> > > Thanks,
>> > >
>> > > Matthew Vita
>> > > www.matthewvita.com
>> > >
>> > > On Mon, Oct 9, 2017 at 1:14 AM, Gandhi Rajan Natarajan <
>> > > [email protected]> wrote:
>> > >
>> > > > Hi Mathew,
>> > > >
>> > > > First I would like to tell you that even I m a newbie in cTAKES.
>> > > > Unfortunately I don’t find any documentation on this. I have
>> > > > followed a crude way to accomplish as this is an one time activity.
>> > > > This is what
>> > > I did:
>> > > >
>> > > > 1) Used dictionary generator GUI to generate Snomed, RxNorm and
>> > > > MEDDRA dictionary data that resulted in '.script' file under my
>> > > > <ctakes_home>\resources\org\apache\ctakes\dictionary\lookup\fas
>> > > > t\<
>> > > > pr
>> > > > oj
>> > > > ect_name>
>> > > > folder
>> > > > 2) The '.script' file has HSQLDB specific queries. I have
>> > > > removed the unwanted statements for me pertaining to HSQLDB
>> > > > from the file and converted them to mysql specific queries manually.
>> > > > 3) I have added semicolons at the end of each line in the
>> > > > script using text editor and splitted the file in to five
>> > > > parts. Then I ran those five sctipr files  in five different
>> > > > mysql command lines. It took me approximately 4 hours to pump
>> > > > all the data in to
>> MySQL DB.
>> > > >
>> > > > I'm not sure whether it is the right way to proceed as I
>> > > > mentioned earlier. But with no documentation available for
>> > > > MySQL DB with cTAKES, this is the approached that worked for
>> > > > me. Hope it will be
>> > > helpful.
>> > > >
>> > > > Regards,
>> > > > Gandhi
>> > > >
>> > > >
>> > > > -----Original Message-----
>> > > > From: Matthew Vita [mailto:[email protected]]
>> > > > Sent: Monday, October 09, 2017 10:41 AM
>> > > > To: [email protected]
>> > > > Subject: Re: HSQLDB out of memory with custom dictionary
>> > > >
>> > > > Gandhi,
>> > > >
>> > > > Thank you for the reply. Do you have any documentation on how
>> > > > to accomplish this?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Matthew Vita
>> > > > www.matthewvita.com
>> > > >
>> > > > On Sun, Oct 8, 2017 at 3:14 AM, Gandhi Rajan Natarajan <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Hi Mathew,
>> > > > >
>> > > > > I feel using MySQL Db would be better idea than using
>> > > > > in-memory HSQLDB. In fact, this also comes handy when you are
>> > > > > planning to deploy ctakes as a web application as in our case.
>> > > > >
>> > > > > Regards,
>> > > > > Gandhi
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Matthew Vita [mailto:[email protected]]
>> > > > > Sent: Sunday, October 08, 2017 6:02 AM
>> > > > > To: [email protected]
>> > > > > Subject: HSQLDB out of memory with custom dictionary
>> > > > >
>> > > > > Hi Sean, Tim, cTAKES Community,
>> > > > >
>> > > > > I have put together what I am considering a pretty standard
>> > > > > dictionary with sources from the following:
>> > > > >
>> > > > >
>> > > > >    -
>> > > > >
>> > > > >    MEDLINEPLUS
>> > > > >    -
>> > > > >
>> > > > >    MSH
>> > > > >    -
>> > > > >
>> > > > >    NCI
>> > > > >    -
>> > > > >
>> > > > >    NDFRT
>> > > > >    -
>> > > > >
>> > > > >    CHV
>> > > > >    -
>> > > > >
>> > > > >    CSP
>> > > > >    -
>> > > > >
>> > > > >    ICPC2P
>> > > > >    -
>> > > > >
>> > > > >    MEDCIN
>> > > > >    -
>> > > > >
>> > > > >    SNOMED
>> > > > >    -
>> > > > >
>> > > > >    RXNORM
>> > > > >    -
>> > > > >
>> > > > >    ICD10
>> > > > >
>> > > > >
>> > > > > However, when copied over to cTAKES (handled by the handy
>> > > > > Dictionary Creator GUI) HSQLDB runs out of memory.
>> > > > >
>> > > > > This is my first experience with HSQLDB so you’ll have to
>> > > > > excuse my limited knowledge here. I do understand that it can
>> > > > > run either in-memory and on disk, but I’m not sure how to
>> > > > > configure
>> this.
>> > > > >
>> > > > > Here is how I am connecting to it:
>> > > > >
>> > > > >
>> > > > >   <dictionary>
>> > > > >
>> > > > >
>> > > > >     <name>sno_rx_16abTerms</name>
>> > > > >
>> > > > >     <implementationName
>> > > > > >org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRare
>> > > > > >Wor
>> > > > > >dD
>> > > > > >ic
>> > > > > >ti
>> > > > > >on
>> > > > > >ary</
>> > > > > implementationName>
>> > > > >
>> > > > >     <properties>
>> > > > >
>> > > > >       <property key="jdbcDriver" value="org.hsqldb.jdbcDriver"
>> > > > > />
>> > > > >
>> > > > >       <property key="jdbcUrl" value=
>> > > > > "jdbc:hsqldb:file:resources/org/apache/ctakes/dictionary/
>> > > > > lookup/fast/sno_rx_16ab/sno_rx_16ab"
>> > > > > />
>> > > > >
>> > > > >       <property key="jdbcUser" value="sa" />
>> > > > >
>> > > > >       <property key="jdbcPass" value="" />
>> > > > >
>> > > > >       <property key="rareWordTable" value="cui_terms" />
>> > > > >
>> > > > >       <property key="umlsUrl" value="
>> > > > > https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser"; />
>> > > > >
>> > > > >       <property key="umlsVendor" value="NLM-6515182895" />
>> > > > >
>> > > > >       <property key="umlsUser" value="CHANGE_ME" />
>> > > > >
>> > > > >       <property key="umlsPass" value="CHANGE_ME" />
>> > > > >
>> > > > >     </properties>
>> > > > >
>> > > > >   </dictionary>
>> > > > >
>> > > > >   <dictionary>
>> > > > >
>> > > > >
>> > > > >
>> > > > > Can I configure HSQLDB to be used on disk? If this is not a
>> > > > > good approach, can I spin up MySQL in its place?
>> > > > >
>> > > > >
>> > > > > Sorry if this has asked before.
>> > > > >
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Matthew Vita
>> > > > > www.matthewvita.com
>> > > > > This email and any files transmitted with it are confidential
>> > > > > and intended solely for the use of the individual or entity
>> > > > > to whom they are
>> > > > addressed.
>> > > > > If you are not the named addressee you should not
>> > > > > disseminate, distribute or copy this e-mail. Please notify
>> > > > > the sender or system manager by email immediately if you have
>> > > > > received this e-mail by mistake and delete this e-mail from
>> > > > > your system. If you are not the intended recipient you are
>> > > > > notified that disclosing, copying, distributing or taking any
>> > > > > action in reliance on the contents of this information is
>> > > > > strictly
>> prohibited and against the law.
>> > > > >
>> > > > This email and any files transmitted with it are confidential
>> > > > and intended solely for the use of the individual or entity to
>> > > > whom they are
>> > > addressed.
>> > > > If you are not the named addressee you should not disseminate,
>> > > > distribute or copy this e-mail. Please notify the sender or
>> > > > system manager by email immediately if you have received this
>> > > > e-mail by mistake and delete this e-mail from your system. If
>> > > > you are not the intended recipient you are notified that
>> > > > disclosing, copying, distributing or taking any action in
>> > > > reliance on the contents of this information is strictly prohibited 
>> > > > and against the law.
>> > > >
>> > > This email and any files transmitted with it are confidential and
>> > > intended solely for the use of the individual or entity to whom
>> > > they are
>> > addressed.
>> > > If you are not the named addressee you should not disseminate,
>> > > distribute or copy this e-mail. Please notify the sender or
>> > > system manager by email immediately if you have received this
>> > > e-mail by mistake and delete this e-mail from your system. If you
>> > > are not the intended recipient you are notified that disclosing,
>> > > copying, distributing or taking any action in reliance on the
>> > > contents of this information is strictly prohibited and against the law.
>> > >
>> > This email and any files transmitted with it are confidential and
>> > intended solely for the use of the individual or entity to whom
>> > they
>> are addressed.
>> > If you are not the named addressee you should not disseminate,
>> > distribute or copy this e-mail. Please notify the sender or system
>> > manager by email immediately if you have received this e-mail by
>> > mistake and delete this e-mail from your system. If you are not the
>> > intended recipient you are notified that disclosing, copying,
>> > distributing or taking any action in reliance on the contents of
>> > this information is strictly prohibited and against the law.
>> >
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they
>> are addressed. If you are not the named addressee you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> or system manager by email immediately if you have received this
>> e-mail by mistake and delete this e-mail from your system. If you are
>> not the intended recipient you are notified that disclosing, copying,
>> distributing or taking any action in reliance on the contents of this
>> information is strictly prohibited and against the law.
>>
>
>
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the named addressee you should not disseminate, distribute or copy 
this e-mail. Please notify the sender or system manager by email immediately if 
you have received this e-mail by mistake and delete this e-mail from your 
system. If you are not the intended recipient you are notified that disclosing, 
copying, distributing or taking any action in reliance on the contents of this 
information is strictly prohibited and against the law.

RE: HSQLDB out of memory with custom dictionary

Reply via email to