from:"Finan, Sean"

RE: to involve in your development group

2013-07-22 Thread Finan, Sean

Hi Sandeep,

I just took a peek at the JavaOcr code, and it looks like they perform image 
filtering in the PixelImage class.  This would probably cause a problem with 
dot matrix images as every corner of every dot would be removed as noise, so 
dots that participate in curves on characters such as P would be removed to 
form something more like |'.  In fact, depending upon the spacing between 
matrix dots and the resolution of the scan, the filter could decrease the size 
of each dot, making it very difficult for the ocr to work at all.

Assuming that you have already tried to train the software using your dot 
matrix printings, you could change JavaOcr to use java advanced imaging (jai).  
You would then use the jai Raster class instead of the javaocr PixelImage class 
for image manipulation.  There are a lot of things that you could do from that 
point forward.

Just giving you my initial thought,

Sean

-Original Message-
From: sandeep rg [mailto:sandeep.f...@gmail.com] 
Sent: Monday, July 22, 2013 10:06 AM
To: dev@ctakes.apache.org
Subject: Re: to involve in your development group

sir,
 i have gone through some of the medical record such as bills,patient
details etc. most of them are printed using dot matrix printer,which is
very hard to extract such type text from scanned images.i have done testing
with some professional software such as abbyy fine reader which also given
a poor output.

but sir i have the confidence to do it.but i need more knowledge about
image processing capabilities.so can you suggest any one who is good in
image processing programming in your team?


On Thu, Jul 18, 2013 at 1:22 AM, sandeep rg sandeep.f...@gmail.com wrote:

 i hava done sequence diagram and done some small changes,please go through
 it and tell me if any more thing is to be included


 On Wed, Jul 17, 2013 at 9:37 PM, sandeep rg sandeep.f...@gmail.comwrote:

 it just a skeleton of original proposal


 On Wed, Jul 17, 2013 at 9:31 PM, sandeep rg sandeep.f...@gmail.comwrote:

 the sample work is shared with you both.any more details to be included
 please tell me.
 In which,GUI design,schedule and implementation flow chart design is to
 added which is under construction and will be uploaded within few hours.


 On Wed, Jul 17, 2013 at 7:56 PM, Chen, Pei 
 pei.c...@childrens.harvard.edu wrote:

 pei.stat...@gmail.com

  -Original Message-
  From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
  Sent: Wednesday, July 17, 2013 10:22 AM
  To: dev@ctakes.apache.org
  Subject: Re: to involve in your development group
 
  chris.mattm...@gmail.com
 
  ++
  
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  
  Adjunct Assistant Professor, Computer Science Department University of
  Southern California, Los Angeles, CA 90089 USA
  ++
  
 
 
 
 
 
 
  -Original Message-
  From: sandeep rg sandeep.f...@gmail.com
  Reply-To: dev@ctakes.apache.org dev@ctakes.apache.org
  Date: Wednesday, July 17, 2013 6:53 AM
  To: dev@ctakes.apache.org dev@ctakes.apache.org
  Subject: Re: to involve in your development group
 
  can you provide your gmail id to share the proposal document with
 you?
  
  
  
  On Tue, Jul 16, 2013 at 11:33 PM, sandeep rg sandeep.f...@gmail.com
 
  wrote:
  
   sir,
   i am providing proposal by two days.now i am mainly going through
  ASF-ICFOSS gateway because if i gone through their way and my
 proposal
  is  get selected,ICFOSS will provide some sort of support such as
  certificates,small financial support etc. to us.
  
  
   but,main thing is i like programming,i like to explore through the
   new technologies in coding and like to interact with the coding.so
 if
   my proposal is got rejected,then also i like to work in your
 project
   as a volunteer if you allow me..
  
   now i am preparing a proposal,within 2 days i will submit
   it..Mattmann chris helped me to know more about the format of
  proposal.
  
  
   On Tue, Jul 16, 2013 at 8:12 PM, Chen, Pei
  pei.c...@childrens.harvard.edu
wrote:
  
   Chris/Sandeep,
   According to ASF-ICFOSS, I believe the deadline for submitting
  proposals  is this coming Friday (July 19).
   After which point, mentors will have 2 weeks to review and
  score/accept.
   Just curious, are we planning to follow the same process here?  Or
  since  it's all volunteer work, technically- sandeep and still
  contribute code to  the community and participate in the dev group
  here.
  
   Looking forward to it.
   --Pei
  
  
-Original Message-
From: sandeep rg [mailto:sandeep.f...@gmail.com]
Sent: Monday, July 15, 2013 1:05 PM
To:

RE: cTAKES user interface

2013-10-30 Thread Finan, Sean

 Sean Finan (I think is on this group) already wrote a command line CPE runner 
 like Pei described.
I am in this group, and I have written a very simple cli cpe runner.  As Pei 
mentioned:

 The problem is that most of us who are already familiar with the nitty 
 gritty are probably doing this with some sort of custom scripts or solution.
The class that I have is probably not doing anything that others are not - in 
fact, I'm sure that I used somebody else's code as a template as I am not that 
familiar with Senior Nitty Gritty.

I committed (Trunk, 1537124) a class named CmdLineCpeRunner.java to 
ctakes-utils in package ...utils.cpe   It was so quick 'n dirty that there 
isn't any documentation, no logging, etc. but it gets the job done.  It takes a 
path to a cpe.xml file as an argument and simply runs the pipeline specified 
therein.  

I suggest that James has the correct startup approach:
 However you need to have a classpath set properly. To accomplish that, you 
 could try copying runctakesCPE.bat or runctakesCPE.sh and within the script
file, replacing org.apache.uima.tools.cpm.CpmFrame with [CLASS TO CALL]

To the best of my knowledge the easiest way to create the cpe.xml file is 
probably to run through the gui once, setting up the pipeline and saving the 
xml - but run through at least once to make certain that the pipeline works.

Enjoy,
Sean

-Original Message-
From: Lingren, Todd [mailto:todd.ling...@cchmc.org] 
Sent: Wednesday, October 30, 2013 9:52 AM
To: dev@ctakes.apache.org
Cc: Finan, Sean
Subject: RE: cTAKES user interface

Hi all,
Sean Finan (I think is on this group) already wrote a command line CPE runner 
like Pei described. I've been using it and would be happy to provide some user 
guides if he provides the class,etc. 

Todd Lingren
Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
513-803-9032


-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Tuesday, October 29, 2013 9:56 PM
To: dev@ctakes.apache.org
Subject: Re: cTAKES user interface

Thanks William and Richard, those are both really excellent pointers.
Tim

On 10/29/2013 07:58 PM, William Karl Thompson wrote:
 Nice! 

 +1 for Groovy. It's like being able to program in Python again.

 -Original Message-
 From: Richard Eckart de Castilho [mailto:r...@apache.org] 
 Sent: Tuesday, October 29, 2013 5:49 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTAKES user interface

 Maven allows to do marvelous things on the CLI, provided you throw in an 
 additional component: Groovy.

 We did some amazing self-contained Groovy scripts with uimaFIT and DKPro Core 
 which you might find interesting

   http://code.google.com/p/dkpro-core-asl/wiki/DKProGroovyCookbook

 -- Richard

 On 29.10.2013, at 23:09, Miller, Timothy 
 timothy.mil...@childrens.harvard.edu wrote:

 I think this is also an area where Maven integration was a small step 
 backwards (I greatly appreciate the steps forward it allowed). I used to run 
 stuff from the command line and in scripts more often but it's slightly less 
 straightforward setting up the classpath with maven -- before you could put 
 a simple java -cp lib/*.jar class name in a script, now I'm not sure how 
 to go about it using maven. I'm sure there's a way, but I am afraid of 
 falling down the maven rabbit hole.
 Tim


 On Oct 29, 2013, at 5:53 PM, Chen, Pei wrote:

 +1
 Pan, the short answer is yes- it can be done in CLI.  
 The problem is that most of us who are already familiar with the nitty 
 gritty are probably doing this with some sort of custom scripts or solution.
 Cc' the dev group to get a fresh perspective; not sure what the easiest 
 would be-- run the CPE via command line with default input/output 
 directories or running a Driver Main Class as part of examples.

 --Pei

RE: cTAKES user interface

2013-10-30 Thread Finan, Sean

Well, thanks to my not checking the utils pom (or building trunk since I'm 
currently still in incubator), I made Jenkins angry.  Instead of adding uima as 
a dependency to ctakes-utils, I moved the cpe cli to ctakes-core.  I hope that 
works.  My apologies to anybody that checked out in the last hour.

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Wednesday, October 30, 2013 11:20 AM
To: Lingren, Todd; dev@ctakes.apache.org
Subject: RE: cTAKES user interface

 Sean Finan (I think is on this group) already wrote a command line CPE runner 
 like Pei described.
I am in this group, and I have written a very simple cli cpe runner.  As Pei 
mentioned:

 The problem is that most of us who are already familiar with the nitty 
 gritty are probably doing this with some sort of custom scripts or solution.
The class that I have is probably not doing anything that others are not - in 
fact, I'm sure that I used somebody else's code as a template as I am not that 
familiar with Senior Nitty Gritty.

I committed (Trunk, 1537124) a class named CmdLineCpeRunner.java to 
ctakes-utils in package ...utils.cpe   It was so quick 'n dirty that there 
isn't any documentation, no logging, etc. but it gets the job done.  It takes a 
path to a cpe.xml file as an argument and simply runs the pipeline specified 
therein.  

I suggest that James has the correct startup approach:
 However you need to have a classpath set properly. To accomplish that, 
you could try copying runctakesCPE.bat or runctakesCPE.sh and within 
the script file, replacing org.apache.uima.tools.cpm.CpmFrame with 
[CLASS TO CALL]

To the best of my knowledge the easiest way to create the cpe.xml file is 
probably to run through the gui once, setting up the pipeline and saving the 
xml - but run through at least once to make certain that the pipeline works.

Enjoy,
Sean

-Original Message-
From: Lingren, Todd [mailto:todd.ling...@cchmc.org]
Sent: Wednesday, October 30, 2013 9:52 AM
To: dev@ctakes.apache.org
Cc: Finan, Sean
Subject: RE: cTAKES user interface

Hi all,
Sean Finan (I think is on this group) already wrote a command line CPE runner 
like Pei described. I've been using it and would be happy to provide some user 
guides if he provides the class,etc. 

Todd Lingren
Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
513-803-9032

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Tuesday, October 29, 2013 9:56 PM
To: dev@ctakes.apache.org
Subject: Re: cTAKES user interface

Thanks William and Richard, those are both really excellent pointers.
Tim

On 10/29/2013 07:58 PM, William Karl Thompson wrote:
 Nice! 

 +1 for Groovy. It's like being able to program in Python again.

 -Original Message-
 From: Richard Eckart de Castilho [mailto:r...@apache.org]
 Sent: Tuesday, October 29, 2013 5:49 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTAKES user interface

 Maven allows to do marvelous things on the CLI, provided you throw in an 
 additional component: Groovy.

 We did some amazing self-contained Groovy scripts with uimaFIT and 
 DKPro Core which you might find interesting

   http://code.google.com/p/dkpro-core-asl/wiki/DKProGroovyCookbook

 -- Richard

 On 29.10.2013, at 23:09, Miller, Timothy 
 timothy.mil...@childrens.harvard.edu wrote:

 I think this is also an area where Maven integration was a small step 
 backwards (I greatly appreciate the steps forward it allowed). I used to run 
 stuff from the command line and in scripts more often but it's slightly less 
 straightforward setting up the classpath with maven -- before you could put 
 a simple java -cp lib/*.jar class name in a script, now I'm not sure how 
 to go about it using maven. I'm sure there's a way, but I am afraid of 
 falling down the maven rabbit hole.
 Tim

 On Oct 29, 2013, at 5:53 PM, Chen, Pei wrote:

 +1
 Pan, the short answer is yes- it can be done in CLI.  
 The problem is that most of us who are already familiar with the nitty 
 gritty are probably doing this with some sort of custom scripts or solution.
 Cc' the dev group to get a fresh perspective; not sure what the easiest 
 would be-- run the CPE via command line with default input/output 
 directories or running a Driver Main Class as part of examples.

 --Pei

RE: Sundry; Problem Lists

2013-11-04 Thread Finan, Sean

 Hi John,

I hope that you didn't think that I was belittling your ideas or saying that 
anything has been done (and done).  I was just throwing in two resources for 
further thought.  You have brought forward some great applications for cTakes 
and nlp!  

Sean

From: John Green [john.travis.gr...@gmail.com]
Sent: Thursday, October 31, 2013 7:26 PM
To: dev@ctakes.apache.org
Subject: RE: Sundry; Problem Lists

Last point: I seem to be interested in a current encounter (the now) and 
diagnosis, the article seems to be interested in an arguably just as useful 
tool, the longitudinal problem list (the ever), though very different I would 
think in approach.




Thoughts?

Jg







—
Sent from Mailbox for iPhone

On Thu, Oct 31, 2013 at 7:22 PM, John Green john.travis.gr...@gmail.com
wrote:

 Sean - quick note: after looking at the above two resources, a couple of 
 points.  The first resource confirms what I expected, that the vocabulary 
 exists in ctakes. The second confirms what I suspected: that novel approaches 
 to ordering and identification of top members of a problem list are needed. 
 Namely, that the vocabulary may be there, but thats only a tenth of the 
 battle. Your second great resource you sent me acknowledges this - that 
 prioritization, eg enumeration from most important to least, as well as 
 clumping, are the true battle.
 A point of clarification on my end: it would be interesting to see what could 
 be added on top of existing ctakes in order to facilate a solution to the 
 second problem - clumping and prioritizing. (For instance, from the second 
 article, an acute process may have nothing todo with the past medical history 
 and if an algorithm were concerned with all members as equals, it would miss 
 the issue at hand).
 Just as a thought: working back from the known natural history of diseases 
 would possibly be a route to a solution.
 This is probably well known stuff, so please forgive my ignorance if its all 
 been done/thought of before.
 Again, the two links were very helpful, thank you.
 Jg
 —
 Sent from Mailbox for iPhone
 On Thu, Oct 31, 2013 at 2:04 PM, Finan, Sean
 sean.fi...@childrens.harvard.edu wrote:
 I don't know if what I write below truly applies to the discussion, but here 
 it is.
much of a problem list definition may already be contained to varying degrees
 in existing cTakes databases.
 The UMLS does provide a problem list, but I haven't looked at it.
 http://www.nlm.nih.gov/research/umls/Snomed/core_subset.html
 This might be a paper of interest to you:
 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655994/
 It discusses the use of nlp to create something like a problem list.
 Sean
 
 From: John Green [john.travis.gr...@gmail.com]
 Sent: Thursday, October 31, 2013 12:02 PM
 To: dev@ctakes.apache.org
 Subject: Re: Sundry
 Pei and Tim - Good questions.
 The bottom line is that OPQRST is the algorithm that every clinician uses
 to characterize the history of a sign, symptom or constellation of
 symptoms. Each letter has multiple meanings, but generally they're grouped.
 O for onset, was it quick or slow in onset, P for palliative or provoking
 phenomenon, that is, does tylenol make it better? Does it feel better when
 you lean forward? Is it worse with standing? Q is the quality, generally,
 though I could give more examples of each Ill keep it brief from here, R is
 generally region or radiation of the pain and or sign, S is the severity,
 and T is the time course, is it intermittent? When it happens, how long
 does it last for? I could send documents used to teach new clinicians to
 better comprehend for anyone interested.
 OPQRST, while most residents would assume it is only for teaching new
 clinicians, as Tim said, is a useful tool at all levels. Great clinicians,
 and I work with some great senior folks, use this everyday. The idea that
 it is only for teaching is founded on two things: one, that it doubles as a
 structured mnemonic for characterizing signs and symptoms and two, that
 everyone so far ingrains this into their clinical skill set, unless they
 are geared toward teaching, they, after the basic level, never think about
 it again! Caveat: many good clinicians will tell you to keep it algorithmic
 so that you're systematic and do not overlook details.
 What is it's application to ML? Obviously the furthest desired end-state
 for NLP like cTakes would be understanding a clinical encounter to such a
 nuanced level that detailed diagnoses could be considered along with
 treatment plans. While I only know what I've read in Artificial
 Intelligence: A Modern Approach and picked up from friends over the years
 who were good knowledgeable in this field, I feel that OPQRST would be a
 huge benefit toward beginning to outline the problem of more rigorous ML
 characterization of the clinical narrative.
 The utility of OPQRST may not still be entirely clear to those who have

RE: specificity in selecting EntityMentions when using AggregatePlaintextUMLSProcessor

2013-11-04 Thread Finan, Sean

Hi Ted,

In addition to performing searches, 
  the hyperSql ( http://hsqldb.org/ ) database tool
should allow you to perform inserts into the umls dictionary database used by 
cTakes.

You can also create your own customized dictionary and run cTakes using only 
that dictionary or with umls plus that dictionary.  There are several ways to 
create a custom dictionary, and I think that you can start by looking in the 
resources/ ... /dictionary/lookup/ directory for examples.  It can be a little 
overwhelming if you just want to add one or two terms, and I am in the process 
of trying to make this a little easier for any user.  It may be a while before 
I can add my work to the trunk.   Until then, if you decide to go with the csv 
approach you can probably make it through with the examples in cTakes 
resources.  If you want to create a new hsql database then I can send you my 
(old) instructions on that process - but it might be overkill.

If you really want to know what lies behind the mask of the cTakes umls 
dictionary then I highly recommend that you just interface with it directly 
using the hsql tool.

Sean


From: Assur, Ted [theodore.as...@providence.org]
Sent: Friday, November 01, 2013 5:36 PM
To: dev@ctakes.apache.org
Subject: RE: specificity in selecting EntityMentions when using 
AggregatePlaintextUMLSProcessor

OK, Kind of resurfacing the original topic on this one, after I redirected it 
towards ICD codes last month:

I have several examples, like the one below, where it would be very helpful to 
be able to include UMLS terms that are in the UMLS 2011AB release, e.g. CIN 1 
(CUI = C0349458).

So if I have particular UMLS concepts I want to make sure and include, is there 
a way for me to *add* them to the umls dictionary used by cTAKES?

Ted


-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, September 04, 2013 9:37 AM
To: dev@ctakes.apache.org
Subject: RE: specificity in selecting EntityMentions when using 
AggregatePlaintextUMLSProcessor

I don't know if this is exactly what you want, but you can use the hyperSql ( 
http://hsqldb.org/ ) database tool to perform searches on the umls dictionary 
used by cTakes.
For instance  select * from UMLS_MS_2011AB where FWORD = 'CIN'  will provide 
all the available terms starting with CIN.  In the result you'll see that there 
is no term CIN I, and you'll also see that the only listing from ICD9 is for 
CIN III [C0851140, T191, MTHICD9 233.1]

If you want an icd9 code that isn't in the cTakes umls dictionary then you can 
find it online ... but that won't do you much good wrt cTakes.

Sean

-Original Message-
From: Assur, Ted [mailto:theodore.as...@providence.org]
Sent: Wednesday, September 04, 2013 11:56 AM
To: dev@ctakes.apache.org
Subject: RE: specificity in selecting EntityMentions when using 
AggregatePlaintextUMLSProcessor

Thanks for looking into this, it's been puzzling me.

On another note, I know the cTAKES dictionary uses ICD9, but I'm not familiar 
with how to access that information: In the example I've described below, where 
would I locate the ICD9 for a specific entity?

Thank you

Ted

-Original Message-
From: Pei Chen [mailto:chen...@apache.org]
Sent: Tuesday, September 03, 2013 7:13 PM
To: dev@ctakes.apache.org
Subject: Re: specificity in selecting EntityMentions when using 
AggregatePlaintextUMLSProcessor

You're right, it should have gotten CIN I- that's a strange one, probably 
needs to be debugged/looked into further...

On Tue, Sep 3, 2013 at 10:05 PM, Miller, Timothy 
timothy.mil...@childrens.harvard.edu wrote:
 Ah. So it will get
 CIN 2 (in SNOMED)
 CIN III (in SNOMED)
 CIN 3 (in SNOMED)

 but the rest are not in SNOMED?

 I wonder why it doesn't get CIN I? It looks like that exists in SNOMED
 (though I don't fully understand what all the symbols mean in the umls
 browser).

 CIN I - Cervical intraepithelial neoplasia 1
 [A3002690/SNOMEDCT/SY/285836003]


 On 09/03/2013 09:55 PM, Pei Chen wrote:
 It has the correct parse (POS, chunks, and lookupwindow)- but some of
 the terms do not exist in SNOMED- CIN 2 - Cervical intraepithelial
 neoplasia 2 [A3002688/SNOMEDCT/SY/285838002] exists but not CIN II.
 CIN III [A965/SNOMEDCT/SY/20365006] also exists that's why it was
 able to perform the lookup successfully.
 Note that CIN II synonyms do exist in other umls thersauses such as
 MEDCIN, CCPSS though.  However, the bundled cTAKES dictionaries only
 contain (MeSH, SNOMEDCT, RxNORM, NCI, ICD9) IRRC.

 --Pei

 On Tue, Sep 3, 2013 at 9:44 PM, Miller, Timothy
 timothy.mil...@childrens.harvard.edu wrote:
 That is a good question, Ted!

 I tried it with a simple context: The patient has a CIN III. I'm
 not sure if that is a correct context but I was able to duplicate
 your findings. (Finds a CUI for CIN III but not if you change it to
 CIN II)

 My first thought was that it is the chunker. But the chunker seems
 to get

RE: Sundry; Problem Lists

2013-11-04 Thread Finan, Sean

Excellent!  By the by, I know next to nothing about nlp - I'm just a software 
developer that (for some reason) jumped down this (nlp) particular rabbit hole. 
 When it comes to nlp background, research, state and direction I'm hoping that 
somebody much more knowledgable than I will jump in.

after a thorough pubmed search, no one seems to have tried to build problem 
lists for ACUTE encounters, only as extensions to a past medical history
I''m really glad that we have a truly novel road on which to travel.

 I seem to be interested in a current encounter (the now) [as opposed to]  the 
 longitudinal problem list (the ever).
I think that is a great as both a challenge and possible tool, as well as your 
thought on
 prioritization, eg enumeration from most important to least, as well as 
 clumping

I briefly discussed the first idea (acute vs. historical) with another 
physician (after you brought it up) and there was concurrency that such a 
feature would be extremely useful - if not completely necessary for any real 
clinical use of nlp.  I think that if temporal parsing ever becomes finite 
enough with respect to the time of an event relative to the time of the note 
(DocTimeRel) or with proper narrative containers, then this becomes a possible 
use case.  I mention this in a weak attempt to pull the nlpers into the 
discussion ...

 This is probably well known stuff
Bad assumption ... insert emoticon here ...

working back from the known natural history of diseases would possibly be a 
route to a solution.
Now that is a challenge!

Cheers for the inspiration and enthusiasm,
Sean



From: John Green [john.travis.gr...@gmail.com]
Sent: Monday, November 04, 2013 10:45 AM
To: Finan, Sean
Subject: RE: Sundry; Problem Lists

Oh goodness no, I didnt think that at all! Im so new to the field of NLP, 
anything and everything helps and is appreciated. Heck, im just now learning to 
understand Markov chains.

An additional thought: after a thorough pubmed search, no one seems to have 
tried to build problem lists for ACUTE encouters, only as extensions to a past 
medical history. I think this would be a very fruitful avenue. It could easily 
be scored against a gold standard medical resident list for a few hundred 
patients across depth and acuity.

Just thinkin out loud, bouncing ideas off those who know more than I!

Jg
—
Sent from Mailboxhttps://www.dropbox.com/mailbox for iPhone



On Mon, Nov 4, 2013 at 9:24 AM, Finan, Sean 
sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu 
wrote:

Hi John,

I hope that you didn't think that I was belittling your ideas or saying that 
anything has been done (and done). I was just throwing in two resources for 
further thought. You have brought forward some great applications for cTakes 
and nlp!

Sean

From: John Green [john.travis.gr...@gmail.com]
Sent: Thursday, October 31, 2013 7:26 PM
To: dev@ctakes.apache.org
Subject: RE: Sundry; Problem Lists

Last point: I seem to be interested in a current encounter (the now) and 
diagnosis, the article seems to be interested in an arguably just as useful 
tool, the longitudinal problem list (the ever), though very different I would 
think in approach.




Thoughts?

Jg







—
Sent from Mailbox for iPhone

On Thu, Oct 31, 2013 at 7:22 PM, John Green john.travis.gr...@gmail.com
wrote:

 Sean - quick note: after looking at the above two resources, a couple of 
 points. The first resource confirms what I expected, that the vocabulary 
 exists in ctakes. The second confirms what I suspected: that novel approaches 
 to ordering and identification of top members of a problem list are needed. 
 Namely, that the vocabulary may be there, but thats only a tenth of the 
 battle. Your second great resource you sent me acknowledges this - that 
 prioritization, eg enumeration from most important to least, as well as 
 clumping, are the true battle.
 A point of clarification on my end: it would be interesting to see what could 
 be added on top of existing ctakes in order to facilate a solution to the 
 second problem - clumping and prioritizing. (For instance, from the second 
 article, an acute process may have nothing todo with the past medical history 
 and if an algorithm were concerned with all members as equals, it would miss 
 the issue at hand).
 Just as a thought: working back from the known natural history of diseases 
 would possibly be a route to a solution.
 This is probably well known stuff, so please forgive my ignorance if its all 
 been done/thought of before.
 Again, the two links were very helpful, thank you.
 Jg
 —
 Sent from Mailbox for iPhone
 On Thu, Oct 31, 2013 at 2:04 PM, Finan, Sean
 sean.fi...@childrens.harvard.edu wrote:
 I don't know if what I write below truly applies to the discussion, but here 
 it is.
much of a problem list definition may already be contained to varying degrees
 in existing cTakes databases

RE: cTAKES Groovy...

2013-12-06 Thread Finan, Sean

Good stuff -  Thanks Richard

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, December 06, 2013 3:30 PM
To: 'dev@ctakes.apache.org'
Subject: RE: cTAKES Groovy...

Thanks Richard! That did the trick

I'll create a JIRA and update the script including adding a comment that that 
@GrabResolver  is only needed for pre-OpenNLP 1.5.3 and should be removed when 
we upgrade to 1.5.3+. and I'll update CTAKES-191 Update Apache OpenNLP 
dependency to 1.5.3 with a  reminder to update the script.

Trunk of cTAKES still uses 1.5.2-incubating

-Original Message-
From: dev-return-2297-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2297-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Richard Eckart de Castilho
Sent: Friday, December 06, 2013 2:12 PM
To: dev@ctakes.apache.org
Subject: Re: cTAKES Groovy...

On 06.12.2013, at 18:01, Masanz, James J. masanz.ja...@mayo.edu wrote:

 I have not solved my issues on my ubuntu server yet where Error 
 grabbing Grapes -- [unresolved dependency: jwnl#jwnl;1.3.3: not found]

This has also already been fixed in OpenNLP 1.5.3, so there must be some 
dependency on OpenNLP 1.5.(1|2)-incubating.

Anyway, you should be able to fix it by adding this to the beginning of your 
Groovy script, in front of the Grapes:

@GrabResolver(name='opennlp.sf.net', 
  root='http://opennlp.sourceforge.net/maven2')

-- Richard

RE: UMLS Env variables suggestion

2014-01-06 Thread Finan, Sean

+1

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Monday, January 06, 2014 10:57 AM
To: dev@ctakes.apache.org
Subject: RE: UMLS Env variables suggestion

Sounds like a good idea;
we can just update all of the documentation/scripts to use underscore (_), and 
leave the dot (.) in the code to be deprecated for now?
--Pei

 -Original Message-
 From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
 Sent: Saturday, January 04, 2014 10:10 PM
 To: dev@ctakes.apache.org
 Subject: RE: UMLS Env variables suggestion
 
 This went in to 3.1  https://issues.apache.org/jira/browse/CTAKES-164
 
 I agree - the docs need to be updated if there is consensus on the use 
 of this method.  Personally I think that there should be one supported 
 method, not both dot and underscore.  I would prefer that we remove 
 the dot functionality since it is not operational across all 
 environments, but it isn't up to me alone to remove functionality.
 
 -Original Message-
 From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
 Sent: Saturday, January 04, 2014 4:08 PM
 To: dev@ctakes.apache.org
 Cc: dev@ctakes.apache.org
 Subject: Re: UMLS Env variables suggestion
 
 I believe Sean updated the code to also support underscore (_) as 
 well. But the docs just need to be updated...
 
 
  On Jan 4, 2014, at 4:04 PM, Dewful dew...@gmail.com wrote:
 
  In the documentation, in the .sh files to run ctakes;
 
  # If you plan to use the UMLS Resources, set/export env variables # 
  export ctakes.umlsuser=[username], ctakes.umlspw=[password]
 
  however, simply trying to
 
  export ctakes.umlsuser=myusername, ctakes.umlspw=mypassword
 
  doesnt work because bash3 doesnt allow dots in the keyname and will 
  throw an error
 
  bin/runctakesCVD.sh: line 42: export: `ctakes.umlsuser=username,': 
  not a valid identifier
 
  http://stackoverflow.com/questions/15016403/how-to-export-dot-
 separate
  d-environment-variablesexplains
  some solutions
 
  it may be helpful to show how the user can set these easily if they 
  want to set the env variables this way, possibly using one of the
 suggestions in SO.
 
  N

RE: sentence detector newline behavior

2014-01-22 Thread Finan, Sean

On  my end it looks like my email was reformatted and some of my -newline- 
removed in those last examples ... 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Wednesday, January 22, 2014 3:42 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Thanks James

 but then no typical sentence ending punctuation at the end of the line

Gotcha.  

 So simply using Lines would not suffice in those cases because it 
 would run together sentences where there are more than one on a line

I was actually thinking about something like a Line using -sentence breaks- in 
addition to -newline-.  In other words, a Sentence being what cTakes detects by 
ignoring CR/LF, and Lines being those Sentences subdivided by -newline-.  
Perhaps Line is a horrible moniker.   Regardless, it doesn't solve the 
problem of inappropriately missing punctuation.  I was focused a little more on 
the difference between persistent auto- line wrapping and structured 
information like lists, where the first benefits from Sentence and the second 
from Line.

The Patient has
 been prescribed two
 medications. 

Prescriptions:
  Advil
  Tylenol
  No Aspirin

However, when it comes to the problem that you mention, there is no benefit to 
a Line.

The patient has been seen six times in the past week.  Pain has been 
persistent for ten days Advil and Tylenol have been prescribed
-- 2 sentences, 3 lines

The patient has been seen six times in the past week.  
Pain has been persistent for ten days
Advil and Tylenol have been prescribed
-- 2 sentences, 3 lines

The patient has been seen six times in
 the past week.  Pain has been persistent  for ten days  Advil and Tylenol have 
been prescribed
-- 2 sentences, 5 lines

Nothing can really be done for the last bit where punctuation is missing.

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 3:07 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

I know there are notes where there are multiple sentences on a line, but then 
no typical sentence ending punctuation at the end of the line (or no 
punctuation at all at the end of the line). And in those sections, negation can 
be important.  So simply using Lines would not suffice in those cases because 
it would run together sentences where there are more than one on a line. And 
using sentences alone (as found by OpenNLP 1.5) would not suffice because it 
would run together sentences from different lines.

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one 
direction or the other, we can take a poll of when  where 
cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a 
Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would it make 
more sense to have Sentence ignore -newline- and negation detection itself 
split the Sentence into line items?  If an annotator is interested in list 
items, each of which may be on a distinct -line-, then it can split up the 
Sentence as needed.  I think that James hints that cTakes code already does 
this in some places.  

If a good deal of functionality requires -newline- delimited types, would it 
make sense to introduce a type Line?  If something uses a structured list it 
could iterate through Line types, while something using pure text could iterate 
through Sentence types.  This facilitates section-by-section different 
behavior, does not require any decision on global defaults, and makes data 
selection for training Sentence a nonesuch wrt line breaks.  However, it adds 
to the system and would require a per-use choice decision by developers OR a 
toggle by users (back to the default decision).   Perhaps this has already been 
tried?

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 1:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

The only rule I know of is that cTAKES (prior to ytex integration) always 
forces a sentence break at a newline.
This was because the clinical notes cTAKES original processed never had 
newlines in the middle of a sentence, but did need sentence breaks to occur at 
end of sentence for good negation detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it seems 
to not be ubiquitous.

From others' posts, it seems that we could use an option in cTAKES to turn off 
this forcing of sentence breaks at newlines (or depending on how you look at 
it, an option to turn on the forcing of sentence breaks if we change the 
default behavior)

I think we

RE: YTEX cTAKES 3.1.1 ready

2014-02-06 Thread Finan, Sean

Hi Vijay, 

  I have yet to run across clinical text from a real EMR where newlines 
 represent the end of a sentence

Since James pointed out this possibility a couple weeks ago, I have kept my 
eyes open.  The problem is pretty ubiquitous in a corpus that I'm working with 
right now.  I just opened the first note and gave it a count ... 95 lines 
total, 9 are sentence/phrase (lacking punctuation) endings.  This is not 
including lists, which comprise about half of the note.
One possible conjoinment was Will consider [...] biopsy\nGiven [...].  
Depending upon how cTakes deals with it, the meaning could change drastically.

 I believe cTAKES absolutely has to support sentences with newlines within them

Yes, cTakes should do so, but I hope that you aren't suggesting that it only 
support such a structure.

Where is that easy button?

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Thursday, February 06, 2014 10:31 AM
To: dev@ctakes.apache.org
Cc: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; 
vlad.valtchi...@gmail.com
Subject: Re: YTEX cTAKES 3.1.1 ready

I believe it is worth migrating to trunk.

Note that the sentence detector is also complementary - the existing ctakes 
sentence detector is unchanged - users can choose which sentence detector to 
use.  There are changes to assertion  dependency parsing to support sentences 
without newlines, and that works with both sentence detectors.

I believe cTAKES absolutely has to support sentences with newlines within them 
- I have yet to run across clinical text from a real EMR where newlines 
represent the end of a sentence - the changes to assertion  dependency parsing 
will have to be done at some point.

-vj


On Thu, Feb 6, 2014 at 10:19 AM, Chen, Pei
pei.c...@childrens.harvard.eduwrote:

 VJ,
 Aside from the changes to the existing cTAKES code (sentence detector,
 etc.) [which we could leave out if it's still being debated], Do you 
 think it's worth migrating the ytex code to trunk at this point?
  As you mentioned earlier, it's largely complementary.
 [I was just thinking of saving effort to maintain the separate branch 
 and for simplicity for dev...]

 --Pei

  -Original Message-
  From: vijay garla [mailto:vnga...@gmail.com]
  Sent: Wednesday, February 05, 2014 9:30 PM
  To: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; 
  vlad.valtchi...@gmail.com
  Subject: Re: YTEX cTAKES 3.1.1 ready
 
  Hi Vlad,
 
  I Updated the umls install guide; see
  https://code.google.com/p/ytex/wiki/UMLS_SQL_SERVER_3_1
 
  I would prefer to add the docs in the ctakes confluence, but as far 
  as I
 can
  tell, I don't have write access there - can somebody give me write
 privileges
  on the ctakes confluence site?
 
  There was a bug in the umls install; copy
  https://svn.apache.org/repos/asf/ctakes/branches/ytex/ctakes-
  ytex/scripts/data/build.xmlover
  the corresponding file in your ctakes-3.1.2 install
  (CTAKES_HOME\bin\ctakes-ytex\scripts\data) and you should be set.  
  The import is currently running on the UMLS 2013AA (I assume this 
  will
 complete
  without issues as long as the umls schema hasn't changed from 2012).
 
  what trial and error did you have to go through to build the distro?
 
  -vj
 
 
  On Wed, Feb 5, 2014 at 5:33 PM, vijay garla vnga...@gmail.com wrote:
 
   Hi Vlad,
  
   sorry that the instructions aren't clear.
  
   re 1) What I am trying to say is install 
   apache-ctakes-3.2.0-snapshot as usual (this is unchanged from 
   3.1.1).  After that you still have to apply the lib and resources 
   (these are things that cannot be distributed via apache).
  
   re 2) Yes, I need to update those docs.  Hopefully will get to 
   that at some point.  However, I assume you already have a UMLS DB 
   (also assume SQL Server).  If you can't/don't want to use your 
   existing umls DB, please tell me.  The I'll priortize upgrading 
   the doc on importing the umls tables (the scripts are there).
  
   best,
  
   VJ
  
  
   On Wed, Feb 5, 2014 at 4:44 PM, vlad.valtchi...@gmail.com wrote:
  
   Hi VJ-
  
   so, with trial and error were able to make the distribution and 
   now have the apache-ctakes-3.1.2-SNAPSHOT-bin.zip archive.
  
   Here's what's unclear.
  
   1. Is now this the only (combined) thing that you need for ctakes
   3.1.1 + Ytex?
   the current documentation (https://code.google.com/p/yte 
   x/wiki/Installation_cTAKES_3_1?ts=1388793998updated=Instal
   lation_cTAKES_3_1)
   which most probably is outdated, talks about installing cTakes 
   3.1.1 first and then applying 2 SNAPSHOT archives (downloadable) 
   , lib and resources.
   This is a confusion point.
  
   2. The directions to import UMLS subset are then outdated as well.
   Maybe one should use the old version (ctakes 2.5 and ytex 0.8) to 
   import the RRF files for the UMLS subset and then just use the 
   resulting db. Thoughts?
  
   Thanks,
   Vlad Valtchinov
   Brigham Rad
  
  
   On Thursday,

RE: YTEX cTAKES 3.1.1 ready

2014-02-06 Thread Finan, Sean

Right, got it.  I just wanted to let you know that some EMR notes -do- require 
sentence splitting at newline characters.

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Thursday, February 06, 2014 1:06 PM
To: dev@ctakes.apache.org
Cc: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; 
vlad.valtchi...@gmail.com
Subject: Re: YTEX cTAKES 3.1.1 ready

The cTAKES sentence detector is not changed in the YTEX branch.  The YTEX 
branch has an *additional* sentence detector that does not automatically split 
sentences on newlines - users can use this if they like.

-vj

On Thu, Feb 6, 2014 at 1:01 PM, Finan, Sean  sean.fi...@childrens.harvard.edu 
wrote:

 Hi Vijay,

   I have yet to run across clinical text from a real EMR where 
  newlines
 represent the end of a sentence

 Since James pointed out this possibility a couple weeks ago, I have 
 kept my eyes open.  The problem is pretty ubiquitous in a corpus that 
 I'm working with right now.  I just opened the first note and gave it 
 a count ... 95 lines total, 9 are sentence/phrase (lacking punctuation) 
 endings.
  This is not including lists, which comprise about half of the note.
 One possible conjoinment was Will consider [...] biopsy\nGiven [...].
  Depending upon how cTakes deals with it, the meaning could change 
 drastically.

  I believe cTAKES absolutely has to support sentences with newlines
 within them

 Yes, cTakes should do so, but I hope that you aren't suggesting that 
 it only support such a structure.

 Where is that easy button?

 -Original Message-
 From: vijay garla [mailto:vnga...@gmail.com]
 Sent: Thursday, February 06, 2014 10:31 AM
 To: dev@ctakes.apache.org
 Cc: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; 
 vlad.valtchi...@gmail.com
 Subject: Re: YTEX cTAKES 3.1.1 ready

 I believe it is worth migrating to trunk.

 Note that the sentence detector is also complementary - the existing 
 ctakes sentence detector is unchanged - users can choose which 
 sentence detector to use.  There are changes to assertion  dependency 
 parsing to support sentences without newlines, and that works with 
 both sentence detectors.

 I believe cTAKES absolutely has to support sentences with newlines 
 within them - I have yet to run across clinical text from a real EMR 
 where newlines represent the end of a sentence - the changes to 
 assertion  dependency parsing will have to be done at some point.

 -vj

 On Thu, Feb 6, 2014 at 10:19 AM, Chen, Pei
 pei.c...@childrens.harvard.eduwrote:

  VJ,
  Aside from the changes to the existing cTAKES code (sentence 
  detector,
  etc.) [which we could leave out if it's still being debated], Do you 
  think it's worth migrating the ytex code to trunk at this point?
   As you mentioned earlier, it's largely complementary.
  [I was just thinking of saving effort to maintain the separate 
  branch and for simplicity for dev...]

  --Pei

   -Original Message-
   From: vijay garla [mailto:vnga...@gmail.com]
   Sent: Wednesday, February 05, 2014 9:30 PM
   To: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; 
   vlad.valtchi...@gmail.com
   Subject: Re: YTEX cTAKES 3.1.1 ready

   Hi Vlad,

   I Updated the umls install guide; see
   https://code.google.com/p/ytex/wiki/UMLS_SQL_SERVER_3_1

   I would prefer to add the docs in the ctakes confluence, but as 
   far as I
  can
   tell, I don't have write access there - can somebody give me write
  privileges
   on the ctakes confluence site?

   There was a bug in the umls install; copy
   https://svn.apache.org/repos/asf/ctakes/branches/ytex/ctakes-
   ytex/scripts/data/build.xmlover
   the corresponding file in your ctakes-3.1.2 install
   (CTAKES_HOME\bin\ctakes-ytex\scripts\data) and you should be set.
   The import is currently running on the UMLS 2013AA (I assume this 
   will
  complete
   without issues as long as the umls schema hasn't changed from 2012).

   what trial and error did you have to go through to build the distro?

   -vj

   On Wed, Feb 5, 2014 at 5:33 PM, vijay garla vnga...@gmail.com wrote:

Hi Vlad,

sorry that the instructions aren't clear.

re 1) What I am trying to say is install 
apache-ctakes-3.2.0-snapshot as usual (this is unchanged from 
3.1.1).  After that you still have to apply the lib and 
resources (these are things that cannot be distributed via apache).

re 2) Yes, I need to update those docs.  Hopefully will get to 
that at some point.  However, I assume you already have a UMLS 
DB (also assume SQL Server).  If you can't/don't want to use 
your existing umls DB, please tell me.  The I'll priortize 
upgrading the doc on importing the umls tables (the scripts are there).

best,

VJ

On Wed, Feb 5, 2014 at 4:44 PM, vlad.valtchi...@gmail.com wrote:

Hi VJ-

so, with trial and error were able to make the distribution and 
now

RE: Update: UMLS, cTAKES, and UIMA for applications in genomics

2014-02-24 Thread Finan, Sean

Hi Andy,

We have been using Uima-as here, but with no third-party wrappings.  We have 
set it up to run in standalone and lsf cluster environments, but everything is 
out-of-box with a few custom bash scripts to set environment settings, etc.

Sean

-Original Message-
From: andy mcmurry [mailto:mcmurry.a...@gmail.com] 
Sent: Monday, February 24, 2014 2:16 PM
To: dev@ctakes.apache.org
Subject: Update: UMLS, cTAKES, and UIMA for applications in genomics

Hi all:

I'm writing to update about my efforts to make a cTAKES out of the box VM 
with UMLS support. My specific use cases are for annotating DNA test results, 
so both publication text and patient notes are important towards this goal.

cTAKES VM.
=
Wrote bash scripts to download and install cTAKES on Ubuntu.
Will provide interfaces for REST endpoints for each service (Clojure).


UMLS Services

Wrote Clojure/REST services for invoking the MetamapAPI and looking up 
concepts/synonym entries in the UMLS. Will do the same for cTAKES in the 
upcoming months.


Semantic Representation

Is anyone else using UMLS SemRep http://semrep.nlm.nih.gov/?
These annotations provide secondary evidence for the cTAKES medication  and 
co-reference parsers, as well as additional annotations for other semantic 
types.


Genetic variant parser (HGVS)

Reece Hart released a standard HGVS
parserhttps://bitbucket.org/invitae/hgvswhich I intend to include in the VM 
distribution as an optional UIMA pipeline (callout REST service).


Scalability: UIMA Async Scaleout with Fit = I'm 
planning on using Clojuima https://github.com/jimpil/clojuima to scale at my 
company.
Is everyone else using UIMA-AS as well, or planning to?

RE: How to add a new dictionary database to cTAKES

2014-02-28 Thread Finan, Sean

Hi Abhishek,

You have some interesting timing ...
I can give you the xml specifications that you require if you send me the 
format of your dictionary.

Since you are new to the current dictionary module setup, I might also have a 
simpler solution for you ...

A couple of days ago I checked a new module into Sandbox called 
ctakes-dictionary-lookup2 (how novel a name).  It is a complete replacement of 
the current dictionary lookup module, but both can sit side-by-side in your 
local trunk sandbox or build.  It has an example descriptor that tells it to 
read a bar-separated value file (BSV) as a dictionary, storing it (indexed) in 
memory for fast lookup.  There is an example dictionary and xml descriptor for 
that dictionary.  It accepts 2 or 3 column files in the format CUI|Text or 
CUI|TUI|Text.  It automatically detects the number of columns, but they must be 
in that order.  It also does not need the text fields to be tokenized, allowing 
it to accept Tumor, malignant as well as tumor , malignant as it will 
perform the tokenization upon reading the file.  
As the dictionary will be stored in-memory it should not be huge.  If you do 
have a very large number of terms (50k) then I recommend an hsql db.  The new 
module will take an hsql db with the fixed field names CUI, TUI, RINDEX, 
TCOUNT, TEXT, RWORD.  I will explain what those mean in some documentation that 
I plan to check into sandbox later today, but I can help you build an hsql 
dictionary db ...
Yesterday I checked into sandbox a project named dictionarytool.  It is 
source-only, but I can give you a jar if you want one.  Out-of-the-box it will 
build various dictionaries from a UMLS download.  It can build BSV, Hsql (new 
format) and Hsql (current format) to be used by the new or current dictionary 
lookup modules.

This devlist announcement is a little premature on my part.  I will not get 
usage documentation into sandbox for a day or two, but I can send you copies as 
I go if you are in a hurry, or just give you xml snippets for the current 
module descriptors.  If you send the format of your dictionary then that can be 
done quickly.  I just wanted to let you know that there is another option wrt 
dictionary lookup.

Sean

-Original Message-
From: Abhishek De [mailto:abhishek...@alumnux.com] 
Sent: Friday, February 28, 2014 6:58 AM
To: dev@ctakes.apache.org
Subject: How to add a new dictionary database to cTAKES

 

Hi, 

How do I add a new database to the cTAKES pipeline to perform lookup from? How 
do I specify what columns to look up and how to annotate the text with the 
returned hits? I have gone through the DictionaryLookupAnnotatorDB.xml and 
LookupDesc_Db.xml files. However, I could not understand the meanings of the 
terms like lookupField, metaField, maxPermutationLevel and 
exclusionTags. If I add a new database, I need to configure this xml file 
properly. Please guide me regarding these problems. 

Thanks and Regards, 

Abhishek De

RE: getSeverity etc. for relation extractor

2014-03-21 Thread Finan, Sean

Hi James,

It is starting to resemble a row of falling dominoes ...

I ran with an incubator version of the location of extractor and it did seem 
to find multiple locations for a single d/d.  Functionality may have changed 
since then.

Thanks for all of your attention to this topic.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, March 21, 2014 4:34 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Running from trunk, I don't get any relations for Rash on arm and leg :(

If I change the text to pain in arm and leg I get one LocationOfTextRelation 
annotation with arg1=SignSymptomMention (pain) and arg2=AnatomicalSiteMention 
(arm)

Does the relation extractor support creating a 2nd relation involving pain - 
the one between pain and leg (is this just an unfortunate choice of example) or 
does the relation extractor need enhancement before it would create mutiple 
location_of for a single SignSymptomMention or DiseaseDisorderMention

BTW, I will have to debug the setting of bodyLocation in the code because even 
for pain in arm, when running from trunk, the LocationOfTextRelation 
annotation is being created, but the bodyLocation within the SignSymptomMention 
is not being set because the code in TemplateFillerAnnotator expects arg1 and 
arg2 to be swapped from what they currently are. I'll take a look at what it 
was in cTAKES 3.1 and find out if this is a bug in TemplateFillerAnnotator or 
something else.

-- James

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, March 21, 2014 12:30 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

 until we have a definite, well-defined need (from a user).

Rash on arm and leg

  I don't follow what you mean by your item B) below

[Rash].getLocationRelation()  [Rash : Arm]
[Rash].getLocation()  [Arm]



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, March 21, 2014 12:58 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Yes, if there is more than one severity or location relation for a given 
identified annotation, currently the template filler does just take the last 
severity and or last location.

I suggest not changing the type system to allow a list (FSArray), or at least 
holding off until we have a definite, well-defined need (from a user). 

I think instead, ideally, we would make the template filler smarter at picking 
which severity / which location  when there is more than one for the given 
identified annotation. Therefore I'd rather not make it a list now, when in the 
long run I think it should be a single value. And in the meantime if someone 
has a need, they can look through the relations.

Pei, I don't follow what you mean by your item B) below

-- James

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Thursday, March 20, 2014 2:03 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Awesome!
Thanks James...

On Sean's point about many-to-one relationships.  I think the current type 
system only supports 1 degree_of and severity_of for each IdentifiedAnnotation? 
 
Does the TemplateFiller component currently just take the last one in the list 
currently?
Should we modify the type system to support this in the future- something like 
the below?
A) Support many-to-one
B) Separate out getting the relations and getting the actual identified 
annotations.

One suggestion would be:
IdentifiedAnnotation.getBodyLocations(): FSArrayIdentifiedAnnotation
IdentifiedAnnotation.getBodyLocationRelations(): FSArrayLocationOfTextRelation
IdentifiedAnnotation.getSeverity(): FSArrayModifier
IdentifiedAnnotation.getSeverityRelations(): FSArrayDegreeOfTextRelation

What do others think?
--Pei

 -Original Message-
 From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
 Sent: Thursday, March 20, 2014 2:50 PM
 To: 'dev@ctakes.apache.org'
 Subject: RE: getSeverity etc. for relation extractor
 
 I saw the jira was assigned to me and had a few minutes so I 
 implemented a fix and committed.
 It was more than just the one line.
 The name of the index in which the binary text relations has changed 
 (now separate indexes instead of one for all binary text relations) so 
 I had to change which index was searched.
 
 -Original Message-
 From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
 Sent: Thursday, March 20, 2014 9:28 AM
 To: dev@ctakes.apache.org
 Subject: RE: getSeverity etc. for relation extractor
 
 Thanks for confirm James.  It seem like a bug...
 Chase,
 if you confirm if adding ddm.setSeverity(degreeOfTextRelation);  works 
 for you, I can commit the changes in trunk.
 
 Which also brings up some interesting points:
 1) Should we populate IdentifiedAnnotation.severity() and
 bodylocationof() Directly

RE: getSeverity etc. for relation extractor

2014-03-24 Thread Finan, Sean

Hi James, I don't have an exact phrase to use.  We used the location_of with a 
brain aneurysm project, but the corpus is elsewhere now.  However, it would tag 
things such as [aneurysm] : [middle cerebral artery] and [aneurysm] : [cerebral 
artery] - which is different from arm/leg, but an example of 2 locations for 
one entity.  

From: Masanz, James J. [masanz.ja...@mayo.edu]
Sent: Monday, March 24, 2014 11:05 AM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

I ran  3.1  against pain in arm and leg and I get just one location_of 
relation.
And again no location_of relations for rash on arm and leg

Sean, what was the exact phrase you used with the  incubator version? (or was 
that a while ago and lost)

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, March 21, 2014 3:59 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Hi James,

It is starting to resemble a row of falling dominoes ...

I ran with an incubator version of the location of extractor and it did seem 
to find multiple locations for a single d/d.  Functionality may have changed 
since then.

Thanks for all of your attention to this topic.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, March 21, 2014 4:34 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Running from trunk, I don't get any relations for Rash on arm and leg :(

If I change the text to pain in arm and leg I get one LocationOfTextRelation 
annotation with arg1=SignSymptomMention (pain) and arg2=AnatomicalSiteMention 
(arm)

Does the relation extractor support creating a 2nd relation involving pain - 
the one between pain and leg (is this just an unfortunate choice of example) or 
does the relation extractor need enhancement before it would create mutiple 
location_of for a single SignSymptomMention or DiseaseDisorderMention

BTW, I will have to debug the setting of bodyLocation in the code because even 
for pain in arm, when running from trunk, the LocationOfTextRelation 
annotation is being created, but the bodyLocation within the SignSymptomMention 
is not being set because the code in TemplateFillerAnnotator expects arg1 and 
arg2 to be swapped from what they currently are. I'll take a look at what it 
was in cTAKES 3.1 and find out if this is a bug in TemplateFillerAnnotator or 
something else.

-- James

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, March 21, 2014 12:30 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

 until we have a definite, well-defined need (from a user).

Rash on arm and leg

  I don't follow what you mean by your item B) below

[Rash].getLocationRelation()  [Rash : Arm]
[Rash].getLocation()  [Arm]

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, March 21, 2014 12:58 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Yes, if there is more than one severity or location relation for a given 
identified annotation, currently the template filler does just take the last 
severity and or last location.

I suggest not changing the type system to allow a list (FSArray), or at least 
holding off until we have a definite, well-defined need (from a user).

I think instead, ideally, we would make the template filler smarter at picking 
which severity / which location  when there is more than one for the given 
identified annotation. Therefore I'd rather not make it a list now, when in the 
long run I think it should be a single value. And in the meantime if someone 
has a need, they can look through the relations.

Pei, I don't follow what you mean by your item B) below

-- James

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Thursday, March 20, 2014 2:03 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Awesome!
Thanks James...

On Sean's point about many-to-one relationships.  I think the current type 
system only supports 1 degree_of and severity_of for each IdentifiedAnnotation?
Does the TemplateFiller component currently just take the last one in the list 
currently?
Should we modify the type system to support this in the future- something like 
the below?
A) Support many-to-one
B) Separate out getting the relations and getting the actual identified 
annotations.

One suggestion would be:
IdentifiedAnnotation.getBodyLocations(): FSArrayIdentifiedAnnotation
IdentifiedAnnotation.getBodyLocationRelations(): FSArrayLocationOfTextRelation
IdentifiedAnnotation.getSeverity(): FSArrayModifier
IdentifiedAnnotation.getSeverityRelations(): FSArrayDegreeOfTextRelation

What do others think?
--Pei

 -Original Message-
 From: Masanz, James J. [mailto:masanz.ja...@mayo.edu

RE: Temporal Information Extraction package has compile time error

2014-03-27 Thread Finan, Sean

Hi Manu,

Speaking for the developers of that module, we are excited that you and others 
in the community are starting to show so much interest in temporal information 
extraction - enough to attempt builds and trial runs.

The Temporal module is still in an academic experimental phase and there are 
some necessary models and custom third-party library extensions that are 
necessary to build but have not or cannot be checked into the cTakes 
repository.  We hope to have Temporal ready for full build and use in the 
upcoming cTakes release, but until that time it will remain relatively unusable 
by the wider cTakes community.  I apologize if its placement in trunk caused 
confusion.

All of that having been written, if you have particular ideas on 
implementation, usage or anything else, please let us know.

Sean

-Original Message-
From: Manu Sikka [mailto:manusi...@hotmail.com] 
Sent: Wednesday, March 26, 2014 11:15 PM
To: dev@ctakes.apache.org
Subject: Temporal Information Extraction package has compile time error







Temporal Information Extraction package has compile time error
Please look into it

RE: errors when run BagOfCUIsGenerator.java

2014-04-16 Thread Finan, Sean

Try to open  https://uts-ws.nlm.nih.gov 
If that works then try 
https://uts-ws.nlm.nih.gov/restful/isValidctakes.umlsuser and see if you get a 
message like
This XML file does not appear to have any style information associated with 
it. The document tree is shown below.


If that works and you are comfortable with the code, try with
umlsaddr : https://uts-ws.nlm.nih.gov/restful/isValidctakes.umlsuser
vendor : NLM-6515182895


   /**
* @param umlsaddr -
* @param vendor   -
* @param username -
* @param password -
* @return true if the server at umlsaddr approves of the vendor, user, 
password combination
*/
   public static boolean isValidUMLSUser( final String umlsaddr, final String 
vendor,
  final String username, final String 
password ) {
  String data;
  try {
 data = URLEncoder.encode( licenseCode, UTF-8 ) + = + 
URLEncoder.encode( vendor, UTF-8 );
 data +=  + URLEncoder.encode( user, UTF-8 ) + = + 
URLEncoder.encode( username, UTF-8 );
 data +=  + URLEncoder.encode( password, UTF-8 ) + = + 
URLEncoder.encode( password, UTF-8 );
  } catch ( UnsupportedEncodingException unseE ) {
 LOGGER.error( Could not encode URL for  + username +  with vendor 
license  + vendor );
 return false;
  }
  try {
 final URL url = new URL( umlsaddr );
 final URLConnection connection = url.openConnection();
 connection.setDoOutput( true );
 final OutputStreamWriter writer = new OutputStreamWriter( 
connection.getOutputStream() );
 writer.write( data );
 writer.flush();
 boolean result = false;
 final BufferedReader reader = new BufferedReader( new 
InputStreamReader( connection.getInputStream() ) );
 String line;
 while ( (line = reader.readLine()) != null ) {
final String trimline = line.trim();
if ( trimline.isEmpty() ) {
   break;
}
result = trimline.equalsIgnoreCase( Resulttrue/Result );
 }
 writer.close();
 reader.close();
 return result;
  } catch ( IOException ioE ) {
 LOGGER.error( ioE.getMessage() );
 return false;
  }
   }



-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Wednesday, April 16, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: RE: errors when run BagOfCUIsGenerator.java

Ying,
Are you behind a proxy or firewall?
If you're trying to use the umls resources, it attempts to make a call to their 
umls service to validate your credentials.
--Pei

 -Original Message-
 From: Liu, Ying [mailto:l...@advisory.com]
 Sent: Wednesday, April 16, 2014 1:13 PM
 To: dev@ctakes.apache.org
 Subject: errors when run BagOfCUIsGenerator.java
 
 It failed when run BagOfCUIsGenerator.java. The followings are the 
 error information. Thanks for your help.
 Ying
 
 
 
 Exception in thread main
 org.apache.uima.resource.ResourceInitializationException: 
 Initialization of annotator class 
 org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator
 failed.  (Descriptor: 
 file:/C:/Users/Ying/workspacectakes/ctakes/ctakes-
 dictionary-
 lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml)
 at
 org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.init
 ialize
 AnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
 at
 org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.init
 ialize
 (PrimitiveAnalysisEngine_impl.java:156)
 at
 org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(Analys
 i
 sEngineFactory_impl.java:94)
 at
 org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(C
 ompositeResourceFactory_impl.java:62)
 at
 org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:
 269)
 at
 org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework
 .java:387)
 at
 org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:
 25
 4)
 at
 org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init
 AS
 B(AggregateAnalysisEngine_impl.java:431)
 at
 org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init
 ializ
 eAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375)
 at
 org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init
 ializ
 e(AggregateAnalysisEngine_impl.java:185)
 at
 org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(Analys
 i
 sEngineFactory_impl.java:94)
 at
 org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(C
 ompositeResourceFactory_impl.java:62)
 at
 org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:
 269)
 at
 org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework
 .java:354)
 at
 org.uimafit.factory.AnalysisEngineFactory.createAnalysisEngineFromPath
 (An
 alysisEngineFactory.java:147)

RE: lvg entries

2014-04-17 Thread Finan, Sean

Those variants are not used by the dictionary lookup.  I did look at them to 
see if it was worthwhile for the new dictionary, but they are all over the 
place so I passed.  

From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
Sent: Thursday, April 17, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim

On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
 I don’t know of any applications within cTAKES that make use of this… The 
 reverse (mapping from these “variants” to the normal form) may be useful 
 though.

 Dima

 On Apr 17, 2014, at 11:50, Miller, Timothy 
 timothy.mil...@childrens.harvard.edu wrote:

 Sure, just as an example, I gave it a note with about 1000 words. It
 generates 11500 NonEmptyFSList elements (each is basically one lexical
 variant).

 For the word symptomatic, these are the first 10 of 20 lexical variants:
 Symptomaticer/JJ
 Symptomaticer/RB
 Symptomaticed/VB
 Symptomaticcing/VB
 Symptomatics/VB
 Symptomatics/NN
 Symptomaticked/VB
 Symptomatic/VB
 Symptomatic/JJ
 Symptomatic/RB

 Tim

 On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
 Tim, this is a very interesting observation. Could you please send a few 
 examples of what LVG generates? Both sensical and non :)

 Dima

 On Apr 17, 2014, at 11:28, Miller, Timothy 
 timothy.mil...@childrens.harvard.edu wrote:

 The LVG annotator creates an enormous number of lemmas for every
 WordToken in the CAS, and I'm wondering what the original purpose was? I
 think this is probably a minor bottleneck for speed but mostly a pretty
 big space hog (at least 50% of the space of xmi files in my tests).

 As of right now I'm not sure if any downstream components are using
 these lemmas, and on a manual inspection the precision seems to be
 pretty abysmal (meaning most of them are nonsensical as lexical
 variants), so as I said, just wondering if we can revisit why cTAKES
 generates so many and whether that component can be optimized.

 Thanks
 Tim

RE: lvg entries

2014-04-18 Thread Finan, Sean

+1 false

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all the 
annotators in the default pipeline, before I create the static factory methods 
like we recently discussed. Should I go ahead and change this to make default 
behavior be false?

Tim

On 04/18/2014 12:47 AM, andy mcmurry wrote:
 There is a lot of config handling, maybe PostLemmas is being set to 
 true or
 configInit() is not setting up  the NLM wrapper incorrectly.

 ctakes-lvg *README*
 Note: as distributed, PostLemmas is set to false.  This is done to 
 reduce the size of the CAS.
 Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
 annotations added to the CAS.

 *LvgAnnotator.xml *
 PostLemmas = True

 *LvgAnnotator.java*
 if (postLemmas) {
  lvgResource.getLvgLex()
 }

 On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. 
 masanz.ja...@mayo.eduwrote:

 The normalizedForm field is filled in. It is used by dictionary lookup.

 So, for example, if the dictionary would contain lymph node but not 
 lymph nodes, a document with text of lymph nodes would match the 
 dictionary entry lymph node because node, being the normalized 
 form of nodes, would be used when searching dictionary entries (in 
 addition to searching dictionary entries for nodes)

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Thursday, April 17, 2014 4:33 PM
 To: dev@ctakes.apache.org
 Subject: Re: lvg entries

 Quick follow-up since I was interested. The current dependency parser 
 does have the option to use ctakes lemmas or do its own lemmatizing, 
 but that doesn't use the lemma field, it uses the normalizedForm 
 field. I'm not sure if that field is actually ever filled in -- on my 
 example data it is always null.

 Tim

 On 04/17/2014 01:57 PM, Masanz, James J. wrote:
 Offhand I recall at least one of the dependency parsers used the 
 Lemma
 annotations at one point.
 Not sure if still does.

 There is an option for turning off the posting of the lemmas to the cas.

 Hope that helps

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Thursday, April 17, 2014 11:27 AM
 To: dev@ctakes.apache.org
 Subject: lvg entries

 The LVG annotator creates an enormous number of lemmas for every 
 WordToken in the CAS, and I'm wondering what the original purpose 
 was? I think this is probably a minor bottleneck for speed but 
 mostly a pretty big space hog (at least 50% of the space of xmi files in my 
 tests).

 As of right now I'm not sure if any downstream components are using 
 these lemmas, and on a manual inspection the precision seems to be 
 pretty abysmal (meaning most of them are nonsensical as lexical 
 variants), so as I said, just wondering if we can revisit why cTAKES 
 generates so many and whether that component can be optimized.

 Thanks
 Tim

RE: ytex merged into trunk

2014-04-28 Thread Finan, Sean

Hi Vijay,

I did a checkout this morning and I'm getting compile errors from Maven.

If I just run mvn compile then I get an error while building ytex claiming that 
the package has not been created.  Is there a reversed dependency?

If I run mvn compile package then ytex seems to run through, but there is an 
error in the test of ytex-uima (see below).

Any ideas?

Thanks,
Sean


Running org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest
...
2014-04-28 10:50:43,074 INFO  org.hibernate.dialect.Dialect  - HHH000400: Using 
dialect: org.hibernate.dialect.HSQLDialect
2014-04-28 10:50:43,112 WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper  
- SQL Error: -22, SQLState: S0002
2014-04-28 10:50:43,112 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper  
- Table not found in statement [select uimatype0_.ui
ma_type_id as uima_typ1_21_, uimatype0_.uima_type_name as uima_typ2_21_, 
uimatype0_.table_name as table_na3_21_ from PUBLIC.ref_uima
_type uimatype0_]
...
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.277 sec  
FAILURE!

Results :

Tests in error:
  test(org.apache.ctakes.ytex.uima.annotators.DBCollectionReaderTest): Unable 
to initialize group definition. Group resource name [c
lasspath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory key 
[ytexApplicationContext]; nested exception is org.springframe
work.beans.factory.BeanCreationException: Error creating bean with name 
'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev
/ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instantiation of bean failed;
 nested exception is org.springframework.beans.BeanInstantiationException: 
Could not instantiate bean class [org.springframework.con
text.support.ClassPathXmlApplicationContext]: Constructor threw exception; 
nested exception is org.springframework.beans.factory.Bea
nCreationException: Error creating bean with name 'documentMapperService' 
defined in class path resource [org/apache/ctakes/ytex/uim
a/beans-uima-mapper.xml]: Invocation of init method failed; nested exception is 
org.hibernate.exception.SQLGrammarException: could n
ot prepare statement
  org.apache.ctakes.ytex.uima.annotators.DBConsumerTest: Unable to initialize 
group definition. Group resource name [classpath*:org/
apache/ctakes/ytex/uima/beanRefContext.xml], factory key 
[ytexApplicationContext]; nested exception is org.springframework.beans.fac
tory.BeanCreationException: Error creating bean with name 
'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev/ApacheCtakesT
runk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instantiation of bean failed; nested except
ion is org.springframework.beans.BeanInstantiationException: Could not 
instantiate bean class [org.springframework.context.support.C
lassPathXmlApplicationContext]: Constructor threw exception; nested exception 
is org.springframework.beans.factory.BeanCreationExcep
tion: Error creating bean with name 'documentMapperService' defined in class 
path resource [org/apache/ctakes/ytex/uima/beans-uima-m
apper.xml]: Invocation of init method failed; nested exception is 
org.hibernate.exception.SQLGrammarException: could not prepare sta
tement
  org.apache.ctakes.ytex.uima.annotators.DBConsumerTest
  
testDictionaryLookupIntegrated(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest):
 Initialization of annotator
class org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator failed.  
(Descriptor: file:/C:/Spiffy/Dev/ApacheCtakesTrunk/cta
kes-ytex-uima/desc/analysis_engine/SegmentRegexAnnotator.xml)
  
testDictionaryLookupSimple(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest)
  
testDisambiguate(org.apache.ctakes.ytex.uima.annotators.SenseDisambiguatorAnnotatorTest):
 Unable to initialize group definition. G
roup resource name [classpath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], 
factory key [ytexApplicationContext]; nested excepti
on is org.springframework.beans.factory.BeanCreationException: Error creating 
bean with name 'ytexApplicationContext' defined in URL
 
[file:/C:/Spiffy/Dev/ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instanti
ation of bean failed; nested exception is 
org.springframework.beans.BeanInstantiationException: Could not instantiate 
bean class [or
g.springframework.context.support.ClassPathXmlApplicationContext]: Constructor 
threw exception; nested exception is org.springframew
ork.beans.factory.BeanCreationException: Error creating bean with name 
'documentMapperService' defined in class path resource [org/a
pache/ctakes/ytex/uima/beans-uima-mapper.xml]: Invocation of init method 
failed; nested exception is org.hibernate.exception.SQLGram
marException: could not prepare statement
  org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest: Unable to 
initialize group definition. Group resource name

RE: ytex merged into trunk

2014-04-28 Thread Finan, Sean

Completely new error.  I have taken this offline until we figure out what is 
going on.

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Monday, April 28, 2014 1:47 PM
To: dev@ctakes.apache.org
Subject: Re: ytex merged into trunk

Hello All,

I can't reproduce this build error.  It appears that maven does not want to run 
copy-dependencies in the compile phase.  However, I have tried building this 
with maven 3.2.1 and maven 3.1.0 and it works fine for both.

@Sean - can you send me the output of mvn -x clean install -pl ctakes-ytex 
(executed from ctakes root dir)

This is the plugin that maven is complaining about:
plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-dependency-plugin/artifactId
executions
execution
idcopy-dependencies/id
phasecompile/phase
goals
goalcopy-dependencies/goal
/goals
configuration
outputDirectory${basedir}/target/lib/outputDirectory
overWriteReleasesfalse/overWriteReleases
overWriteSnapshotsfalse/overWriteSnapshots
overWriteIfNewertrue/overWriteIfNewer
/configuration
/execution
/executions
/plugin

On Mon, Apr 28, 2014 at 1:26 PM, vijay garla vnga...@gmail.com wrote:

 sorry about that.  I will investigate.

 -vj

 On Mon, Apr 28, 2014 at 11:00 AM, Finan, Sean  
 sean.fi...@childrens.harvard.edu wrote:

 Hi Vijay,

 I did a checkout this morning and I'm getting compile errors from Maven.

 If I just run mvn compile then I get an error while building ytex 
 claiming that the package has not been created.  Is there a reversed 
 dependency?

 If I run mvn compile package then ytex seems to run through, but 
 there is an error in the test of ytex-uima (see below).

 Any ideas?

 Thanks,
 Sean

 Running org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest
 ...
 2014-04-28 10:50:43,074 INFO  org.hibernate.dialect.Dialect  - HHH000400:
 Using dialect: org.hibernate.dialect.HSQLDialect
 2014-04-28 10:50:43,112 WARN
  org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - SQL Error: -22,
 SQLState: S0002
 2014-04-28 10:50:43,112 ERROR
 org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - Table not found 
 in statement [select uimatype0_.ui ma_type_id as uima_typ1_21_, 
 uimatype0_.uima_type_name as uima_typ2_21_, uimatype0_.table_name as 
 table_na3_21_ from PUBLIC.ref_uima _type uimatype0_] ...
 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.277 
 sec  FAILURE!

 Results :

 Tests in error:
   test(org.apache.ctakes.ytex.uima.annotators.DBCollectionReaderTest):
 Unable to initialize group definition. Group resource name [c 
 lasspath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory 
 key [ytexApplicationContext]; nested exception is org.springframe
 work.beans.factory.BeanCreationException: Error creating bean with 
 name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev
 /ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instantiation of bean failed;
  nested exception is
 org.springframework.beans.BeanInstantiationException: Could not 
 instantiate bean class [org.springframework.con
 text.support.ClassPathXmlApplicationContext]: Constructor threw 
 exception; nested exception is org.springframework.beans.factory.Bea
 nCreationException: Error creating bean with name 'documentMapperService'
 defined in class path resource [org/apache/ctakes/ytex/uim
 a/beans-uima-mapper.xml]: Invocation of init method failed; nested 
 exception is org.hibernate.exception.SQLGrammarException: could n ot 
 prepare statement
   org.apache.ctakes.ytex.uima.annotators.DBConsumerTest: Unable to 
 initialize group definition. Group resource name [classpath*:org/ 
 apache/ctakes/ytex/uima/beanRefContext.xml], factory key 
 [ytexApplicationContext]; nested exception is 
 org.springframework.beans.fac
 tory.BeanCreationException: Error creating bean with name 
 'ytexApplicationContext' defined in URL 
 [file:/C:/Spiffy/Dev/ApacheCtakesT
 runk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instantiation of bean failed; nested except ion is 
 org.springframework.beans.BeanInstantiationException: Could not 
 instantiate bean class [org.springframework.context.support.C
 lassPathXmlApplicationContext]: Constructor threw exception; nested 
 exception is org.springframework.beans.factory.BeanCreationExcep
 tion: Error creating bean with name 'documentMapperService' defined 
 in class path resource [org/apache/ctakes/ytex/uima/beans-uima-m
 apper.xml]: Invocation of init method failed; nested exception is
 org.hibernate.exception.SQLGrammarException: could not prepare sta 
 tement
   org.apache.ctakes.ytex.uima.annotators.DBConsumerTest

 testDictionaryLookupIntegrated(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest):
 Initialization of annotator
 class org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator
 failed.  (Descriptor: file:/C:/Spiffy/Dev/ApacheCtakesTrunk/cta
 kes-ytex-uima/desc/analysis_engine

RE: Preparing for an Apache cTAKES 3.2 Release?

2014-06-11 Thread Finan, Sean

 it would be incredibly helpful to have thorough documentation

I agree.  There is some documentation in the module's doc/ directory, but it is 
very brief.  There are also some example descriptors in the example/ directory. 
 The -resource also has some example xmls and dictionaries.

It isn't much, but I have a small plate heaped with large portions of many 
courses and very little time to document.  If there are questions please write 
me and I'll update the documentation as necessary.  Anybody else that feels 
inclined can also add to the docs.  Eventually the documentation should be 
moved to reside with the rest of the cTakes docs.

Sean

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Wednesday, June 11, 2014 9:33 AM
To: dev@ctakes.apache.org
Subject: Re: Preparing for an Apache cTAKES 3.2 Release?

regardless of the name, I think it would be incredibly helpful to have thorough 
documentation on the dictionary lookup, how to configure it, and how to create 
new dictionaries.  I would venture to say that this is the most important 
component in cTAKES, and probably the one that has generated the most questions 
on the newsgroup.



On Wed, Jun 11, 2014 at 9:21 AM, Finan, Sean  
sean.fi...@childrens.harvard.edu wrote:

 . The newer NER should have in its name the Behavior...

 I agree, but the *2 module is a complete replacement for the current 
 lookup.  It does not (really) have any different behavior, just a 
 different implementation and performance.  We plan to swap out the old 
 with the new in the next release and get rid of the *2 suffix.  So, 
 any name provided now is just temporary - unless people don't like the 
 name dictionary-lookup at all.

 In my original sandbox it was named RareWordLookup, a nod to its 
 implementation.  However, this doesn't help any users.

 Sean

 -Original Message-
 From: andy mcmurry [mailto:mcmurry.a...@gmail.com]
 Sent: Wednesday, June 11, 2014 3:09 AM
 To: dev@ctakes.apache.org
 Subject: Re: Preparing for an Apache cTAKES 3.2 Release?

 2 doesn't mean much. The newer NER should have in its name the 
 Behavior...

 Perhaps something like MetaMap Usage
 http://metamap.nlm.nih.gov/Docs/MM09_Usage.shtml --allow_overmatches
 or  --allow_concept_gaps or .other?

 Since yTex already provides a pluggable *DictionaryLookup, *that seems 
 like the best place to define the differing Behavior /  Usage.

 https://cwiki.apache.org/confluence/display/CTAKES/User's+Guide
 https://code.google.com/p/ytex/wiki/DictionaryLookup_V05


 AndyMC

 On Tue, Jun 10, 2014 at 9:55 AM, britt fitch britt.fi...@gmail.com
 wrote:

  I don’t have an issue with the *-2 name. I also don’t have any 
  objections to renaming it.
 
  It might be nice to keep the old dictionary code around for a 
  release-worth of time but after that I would vote purging it.
  If someone needs it after that it’ll be accessible in the archived 
  releases.
 
 
 
  On Jun 10, 2014, at 12:48 PM, Chen, Pei 
  pei.c...@childrens.harvard.edu
  wrote:
 
   I think James has a fair point here.
   It may be worthwhile biting the bullet here and push forward.
  
   Since this essentially will be a full replacement of the
  ctakes-dictionary-lookup module, a good option maybe to just replace 
  the entire module now and rename the existing module to * _deprecated.
   How do folks feel about that?  In a nutshell,
   ctakes-dictionary-lookup-2
  is a faster algorithm with a simpler code base- and comparable 
  results (Sean has a full comparison in the documentation for those 
  who are
 curious).
  
   --Pei
  
   -Original Message-
   From: britt fitch [mailto:britt.fi...@gmail.com]
   Sent: Monday, June 09, 2014 5:42 PM
   To: dev@ctakes.apache.org
   Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
  
   There is some documentation in the dictionary2 module under 
   /doc/DictionaryLookupHelp.{txt | docx} that gives some some 
   details of
  the
   different lookup implementation options within that module that I 
   found helpful.
  
  
   On Jun 9, 2014, at 5:17 PM, Masanz, James J.
   masanz.ja...@mayo.edu
   wrote:
  
  
   Will ctakes-dictionary-lookup2 remain the name for the new 
   dictionary
   lookup or will it have a name that reflects the algorithm?
  
   Is there a description of it that will help users to decide when 
   to
  use one
   dictionary lookup component vs. the other.
  
   -- James
  
   -Original Message-
   From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
   Sent: Friday, June 06, 2014 12:34 PM
   To: dev@ctakes.apache.org
   Subject: Preparing for an Apache cTAKES 3.2 Release?
  
   Hi,
   The 3.2 release was slated to be release end of this month (Jun 21).
   Since I volunteered to be the RM for this release, just like the 
   past
   releases, I was planning to create a branch/tag next week from 
   trunk and dev can continue.
   Feel free to take a look at any outstanding Jira issues [1] that 
   you
  may want

RE: Preparing for an Apache cTAKES 3.2 Release?

2014-06-16 Thread Finan, Sean

I guess that I've got one question at this point:

Is the name being given to the -new- dictionary lookup module temporary or 
permanent?  

I was under the assumption that it was temporary and that with the switch to it 
being default (and eventually only) the module would simply be named 
dictionary-lookup.



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Monday, June 16, 2014 11:24 AM
To: 'dev@ctakes.apache.org'
Subject: RE: Preparing for an Apache cTAKES 3.2 Release?

I'd rather something else than dictionary-lookup-fast. If we come up with 
something even faster than this one, having an older one called fast could be 
confusing.

-Original Message-
From: Dligach, Dmitriy [mailto:dmitriy.dlig...@childrens.harvard.edu]
Sent: Monday, June 16, 2014 9:55 AM
To: cTAKES Developer list
Subject: Re: Preparing for an Apache cTAKES 3.2 Release?

+1

Dima




On Jun 16, 2014, at 9:42, Miller, Timothy 
timothy.mil...@childrens.harvard.edu wrote:

 Sorry to weigh in so late on this -- just returned from vacation. If 
 we want to have a one release delay before making dictionary2 default 
 for testing/documentation/configuration purposes, and there isn't an 
 obvious function-related name, and the main difference is speed, maybe 
 we could call it dictionary-lookup-fast? Besides being accurate and 
 more descriptive than 2, it might lure people into trying it and 
 give us some feedback.
 
 Tim
 
 
 On 06/16/2014 10:34 AM, Chen, Pei wrote:
 I'm making some significant updates to trunk that may cause some instability 
 for this release.
 It should be mostly transparent, but let me know if you encounter any issues 
 with trunk.
 
 Also, regarding the dictionary-lookup2.  If there are no strong objections, 
 we can leave default to as-is (old behavior).  Folks who wish to give the 
 new one a try are welcome to do so and we can change the default behavior in 
 a future release.
 
 [ducks for cover now]
 --Pei
 
 -Original Message-
 From: ksa...@gmail.com [mailto:ksa...@gmail.com] On Behalf Of 
 Karthik Sarma
 Sent: Wednesday, June 11, 2014 9:58 AM
 To: dev@ctakes.apache.org
 Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
 
 Agreed
 
 On Wednesday, June 11, 2014, vijay garla vnga...@gmail.com wrote:
 
 regardless of the name, I think it would be incredibly helpful to 
 have thorough documentation on the dictionary lookup, how to 
 configure it, and how to create new dictionaries.  I would venture 
 to say that this is the most important component in cTAKES, and 
 probably the one that has generated the most questions on the newsgroup.
 
 
 
 On Wed, Jun 11, 2014 at 9:21 AM, Finan, Sean  
 sean.fi...@childrens.harvard.edu wrote:
 
 . The newer NER should have in its name the Behavior...
 I agree, but the *2 module is a complete replacement for the 
 current lookup.  It does not (really) have any different behavior, 
 just a
 different
 implementation and performance.  We plan to swap out the old with 
 the new in the next release and get rid of the *2 suffix.  So, any 
 name provided now is just temporary - unless people don't like the 
 name dictionary-lookup at all.
 
 In my original sandbox it was named RareWordLookup, a nod to its 
 implementation.  However, this doesn't help any users.
 
 Sean
 
 -Original Message-
 From: andy mcmurry [mailto:mcmurry.a...@gmail.com]
 Sent: Wednesday, June 11, 2014 3:09 AM
 To: dev@ctakes.apache.org
 Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
 
 2 doesn't mean much. The newer NER should have in its name the 
 Behavior...
 
 Perhaps something like MetaMap Usage 
 http://metamap.nlm.nih.gov/Docs/MM09_Usage.shtml --
 allow_overmatches
 or  --allow_concept_gaps or .other?
 
 Since yTex already provides a pluggable *DictionaryLookup, *that 
 seems like the best place to define the differing Behavior /  Usage.
 
 https://cwiki.apache.org/confluence/display/CTAKES/User's+Guide
 https://code.google.com/p/ytex/wiki/DictionaryLookup_V05
 
 
 AndyMC
 
 On Tue, Jun 10, 2014 at 9:55 AM, britt fitch 
 britt.fi...@gmail.com
 wrote:
 
 I don't have an issue with the *-2 name. I also don't have any 
 objections to renaming it.
 
 It might be nice to keep the old dictionary code around for a 
 release-worth of time but after that I would vote purging it.
 If someone needs it after that it'll be accessible in the 
 archived releases.
 
 
 
 On Jun 10, 2014, at 12:48 PM, Chen, Pei 
 pei.c...@childrens.harvard.edu
 wrote:
 
 I think James has a fair point here.
 It may be worthwhile biting the bullet here and push forward.
 
 Since this essentially will be a full replacement of the
 ctakes-dictionary-lookup module, a good option maybe to just 
 replace the entire module now and rename the existing module to *
 _deprecated.
 How do folks feel about that?  In a nutshell,
 ctakes-dictionary-lookup-2
 is a faster algorithm with a simpler code base- and comparable 
 results (Sean has a full comparison

RE: DeepPheno: guidance on CTakes

2014-06-27 Thread Finan, Sean

Hi Pei,

Nice examples.  The pipeline builder could be simpler (divvied), but they 
shouldn't leave anybody confused.

+1 for the uimafit annotations!

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Friday, June 27, 2014 11:11 AM
To: Hochheiser, Harry Stewart; dev@ctakes.apache.org
Subject: RE: DeepPheno: guidance on CTakes

+dev
Harry,
I've just checked in some two example java classes [1] that should make life a 
lot easier for developers to create and add new cTAKES Annotators.
It will shield users initially from all of the complexities of UIMA, XML 
Descriptors, cTAKES, etc.

Just check out the latest: 
svn co http://svn.apache.org/repos/asf/ctakes/trunk
mvn clean compile

--Pei
[1] 
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-examples/src/main/java/org/apache/ctakes/examples/

 -Original Message-
 From: Hochheiser, Harry Stewart [mailto:har...@pitt.edu]
 Sent: Thursday, June 26, 2014 5:31 PM
 To: Chen, Pei
 Subject: DeepPheno: guidance on CTakes
 
 Pei:
 
 As I'm now digging into cTAKES as part of our DeepPheno project (and 
 some other related efforts), I'm hoping you can help with a quick 
 question. Is there any guide/documentation on the process for adding 
 new annotators to cTAKES?  I've dug into the apache site and mailing 
 list archives, but haven't had much luck.
 
 Thanks!
 
 -harry
 
 
 Harry Hochheiser
 University of Pittsburgh
 Department of Biomedical Informatics
 har...@pitt.edu   412 648 9300

RE: Bacterium Dictionary

2014-06-30 Thread Finan, Sean

Hi Nick,
There are ~26,000 T007 Bacterium (falls under Living Being) entries in UMLS 
2013aa.  They aren't in the cTakes dictionary, but you can build a separate 
bacteria dictionary using the dictionary creator tool in cTakes sandbox.  It 
can create dictionaries formatted for use with both available 
cTakes-dictionary-lookup modules.  I have a full living beings dictionary, if 
you want to somehow confirm your umls license then I could pull out the 
bacteria for you.
Sean

 -Original Message-
 From: Pei Chen [mailto:chen...@apache.org]
 Sent: Monday, June 30, 2014 12:50 PM
 To: dev@ctakes.apache.org
 Subject: Re: Bacterium Dictionary
 
 Nick,
 I am not sure how complete it is, but I believe the UMLS has the semantic type
 of
 
 Bacterium
 https://uts.nlm.nih.gov//semanticnetwork.html#Bacterium;0;0;2014AA
 
  [T007]
   It's most likely not included in the default cTAKES dictionaries though...
 
 Thanks,
 Pei
 
 
 On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish 
 snika...@emerginghealthit.com wrote:
 
   Hi there,
 
 
 
  I was wondering if Ctakes has any Bacterium Dictionary? I need to
  extract information for bacteria like “Enterococcus Faecium”,
  “Pseudomonas Aeruginosa “ , etc  and I was wondering if I can do it by
  using Ctakes annotators?
 
 
 
  Thanks,
 
 
 
  *Nick Nikandish*
 
  *Product Development Software Engineer*
 
  Clinical Research Informatics
 
 
 
  *Emerging Health*
 
  *Montefiore Information Technology*
 
  6 Executive Blvd. Suite 290, Yonkers, NY 10701
 
  914-457-6792 Office
 
  snika...@montefiore.org
 
  www.emerginghealthit.com
 
  www.montefiore.org
 
 
 
  [image: logo-montefiore-it]

RE: [VOTE] Release Apache cTAKES 3.2.0

2014-07-02 Thread Finan, Sean

+1

Pulled fresh candidate, built, and ran Clinical using CPE without problem.  
Other than that, no testing.  SVN gave me a problem initially (checked out as 
anonymous) asking for a password then flunking the checkout, but an update 
completed it.  I blame the heat.

From: Masanz, James J. [masanz.ja...@mayo.edu]
Sent: Monday, June 30, 2014 10:24 PM
To: dev@ctakes.apache.org
Subject: RE: [VOTE] Release Apache cTAKES 3.2.0

This is pretty obvious, but since this is a record of what was voted upon,
note that some of the URLs contain an extra

ctakes-3.2.0/

For example
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz

should be just
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz

-- James

From: Pei Chen [chen...@apache.org]
Sent: Friday, June 27, 2014 5:15 PM
To: dev@ctakes.apache.org
Subject: [VOTE] Release Apache cTAKES 3.2.0

Hi all,

This is a call for a vote on releasing the following candidate (rc1) as
Apache cTAKES 3.2.0.
The major changes include:
- New optional YTEX component(s) (Yale Extensions to cTAKES)
- New optional improved/faster dictionary lookup (dictionary-lookup-fast)
- New optional Temporal component (Time + Event extraction.  Relations will
be including in a future release.)
- Other bug fixes/enhancements from Jira

[TODO: Online documentation still needs to be updated on wiki for the abo]

For more detailed information on the changes/release notes, please visit:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621version=12324066

The release was made using the cTAKES release process documented here:
http://ctakes.apache.org/ctakes-release-guide.html

The candidate is available at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz
/.zip

The tag to be voted on:
http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/

The MD5 checksum of the tarball can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.md5
/.zip.md5

The signature of the tarball can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.asc
/.zip.asc

Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:
https://dist.apache.org/repos/dist/release/ctakes/KEYS

Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is
open for at least the next 72 hours.
Only votes from the cTAKES PMC are binding, but folks are welcome to check
the release candidate and voice their approval or disapproval.
The vote passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache cTAKES 3.2.0
[ ] -1 Do not release the packages because...

Also, the convenience binary can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-bin.tar.gz
/.zip
Note: It's tempoarily on people.a.o because the artifacts were too large
for https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
increasing the limit).

Thanks!

RE: [VOTE] Release Apache cTAKES 3.2.0 (rc2)

2014-07-10 Thread Finan, Sean

+1 for the ytex method of handling a umls login before download of the umls 
resources.  While this also doesn't truly prevent people from sharing files 
(data) without a umls account, it is a little bit of a nicer mechanism.

Aside ...  Does anybody out there have experience with izpack?  (izpack.org)  
Creation of an InstallAnywhere style module is under consideration ...

 -Original Message-
 From: vijay garla [mailto:vnga...@gmail.com]
 Sent: Wednesday, July 09, 2014 10:30 AM
 To: dev@ctakes.apache.org
 Subject: Re: [VOTE] Release Apache cTAKES 3.2.0 (rc2)
 
 ctakes-ytex-lib-3.1.2-SNAPSHOT.zip
 https://ytex.googlecode.com/files/ctakes-ytex-lib-3.1.2-SNAPSHOT.zip - this
 contains non-asf compliant ytex libs.  I would like to add it to the 
 sourceforge
 site / or add it to the ctakes resources directly (that way users simply have 
 to
 unzip a single zip file)
 
 ctakes-ytex-resources-3.1.2-SNAPSHOT.zip
 http://www.ytex-nlp.org/umls.download/secure/3.1/ctakes-ytex-resources-
 3.1.2-SNAPSHOT.zip
 -
 this contains data derived from the UMLS - concept graphs and dictionary
 lookup tables.  downloading this requires a UTS login.  It is conceptually no
 different from the ctakes resources, so I believe it would be OK to add it to 
 that
 zip file, but I'm not a lawyer.
 
 On another note: I think forcing users to specify the UTS username/password
 and contacting NIH every time you run cTAKES is problematic, and doesn't
 prevent users who don't have a valid UTS login from viewing the data contained
 in the lucene index dictionary.  I personally believe requiring a UTS login to
 download would be the best way to make resources derived from the UMLS
 available to users (this is what I'm doing for ytex-resources).
 
 to summarize: for now, I would like to add the ytex libs to the ctakes 
 resources
 zip.
 
 -vj
 
 
 
 
 On Wed, Jul 9, 2014 at 4:04 PM, Chen, Pei pei.c...@childrens.harvard.edu
 wrote:
 
  The maven artifacts are also available in the staging area:
  https://repository.apache.org/content/repositories/orgapachectakes-100
  1
  VJ: Just curious- how did you envision ytex users downloading the
  jars/war? From the distro bin.zip or from maven central?
 
  --Pei
 
   -Original Message-
   From: Pei Chen [mailto:chen...@apache.org]
   Sent: Tuesday, July 08, 2014 6:11 PM
   To: dev@ctakes.apache.org
   Subject: [VOTE] Release Apache cTAKES 3.2.0 (rc2)
  
   Hi all,
  
   The main difference between rc1 and rc2 is that we removed the
   lvg-res
  and
   assertion-res.jar from the distro.  They still need to be unpacked.
  
   This is a call for a vote on releasing the following candidate (rc2)
   as
  Apache
   cTAKES 3.2.0.
   The major changes include:
   - New optional YTEX component(s) (Yale Extensions to cTAKES)
   - New optional improved/faster dictionary lookup
   (dictionary-lookup-fast)
   - New optional Temporal component (Time + Event extraction.
   Relations
  will
   be including in a future release.)
   - Other bug fixes/enhancements from Jira
  
   [TODO: Online documentation still needs to be updated on wiki]
  
   For more detailed information on the changes/release notes, please visit:
  
  https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313
  621
   version=12324066
  
   The release was made using the cTAKES release process documented here:
   http://ctakes.apache.org/ctakes-release-guide.html
  
   The candidate is available at:
   http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
   -
   3.2.0-src.tar.gz
   /.zip
  
   The tag to be voted on:
   http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc2
  
   The MD5 checksum of the tarball can be found at:
   http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
   -
   3.2.0-src.tar.gz.md5
   /.zip.md5
  
   The signature of the tarball can be found at:
   http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
   -
   3.2.0-src.tar.gz.asc
   /.zip.asc
  
   Apache cTAKES' KEYS file, containing the PGP keys used to sign the
  release:
   https://dist.apache.org/repos/dist/release/ctakes/KEYS
  
   Please vote on releasing these packages as Apache cTAKES 3.2.0. The
   vote
  is
   open for at least the next 72 hours.
   Only votes from the cTAKES PMC are binding, but folks are welcome to
  check
   the release candidate and voice their approval or disapproval.
   The vote passes if at least three binding +1 votes are cast.
  
   [ ] +1 Release the packages as Apache cTAKES 3.2.0 [ ] -1 Do not
   release
  the
   packages because...
  
   Also, the convenience binary can be found at:
   http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
   -
   3.2.0-bin.tar.gz
   /.zip
  
   Note: It's temporarily on people.a.o because the artifacts were too
  large for
   https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
   increasing the limit).
  
  
   Thanks!

RE: Lucene for UMLS2014

2014-07-21 Thread Finan, Sean

Hi Harpreet,

If you are willing to use cTakes 3.2, try the dictionary-lookup-fast module as 
a replacement of the default dictionary-lookup.  That module has a new 
dictionary resource (hsql, not lucene) and slightly different methods for 
lookup and matching.  In time trials it has been faster than the default module 
(hence the name).  Accuracy depends upon the parameter settings, but in the 
tests performed so far the results are comparable or better.  The new 
dictionary is much leaner than the current default dictionary, small enough to 
port from the hsql cached version to a hsql in-memory version.  Using the 
in-memory version makes dictionary lookup practically instantaneous (hundredths 
of a second).  Limited documentation is available in the module's doc/ 
directory.

I will be on vacation for a week, but please don't hesitate to write if you 
have any questions.

Sean

From: Harpreet Khanduja [hsk5...@rit.edu]
Sent: Thursday, July 17, 2014 5:07 PM
To: dev@ctakes.apache.org
Subject: Lucene for UMLS2014

Hello,
I would be grateful if someone could help.

I created a lucene index for umls2014 but only for snomed vocabulary.
I did this because I thought this would reduce the dictionary look up
time.
But it still almost the same. Is there any other way to improve the
dictionary look up time?

Thank you,
Harpreet

RE: question about sentence segmentation

2014-08-02 Thread Finan, Sean

Hi Tim,

 It would be preferable to me to put sentence breaks in between the sections, 
 so
 the first two sentences would be:
 
 1) PE: Lymphonodes...
 2) Lungs: normal...

The punctuation is (always) after the logical break, being Term:  for a 
Term:Definition list.  I think that the first three sentences should be
1) PE:
2) Lymphnodes: neck and ...
3) CV: regular and ...
Where the first line is an overarching Term: sentence (tree root), because each 
Term:Definition line that follows is within the physical exam.

Just an fyi.  Does that make sense?  Haven't had my coffee ...
Sean

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Saturday, August 02, 2014 7:44 AM
 To: dev@ctakes.apache.org
 Subject: RE: question about sentence segmentation
 
 I'm annotating some oncology notes from SHARP right now, and they are
 basically a nightmare for our current sentence segmentation model. Mainly
 because they eschew explicit markers between sentences. I thought I'd ping the
 list with some interesting examples just in case it stimulates ideas. But it 
 seems
 to me that at some point we'll have to augment the opennlp module (preferable)
 or roll our own to handle cases like these.
 
 In this example a bunch of background is on one line with no punctuation
 between logical breaks:
 PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to
 auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2
 normal, no murmur, click, rub or gal*, chest is clear without rales or 
 wheezing,
 no pedal edema, no JVD, no hepatosplenomegaly Breast: negative findings
 right/left breast with mild swelling, warmth, mild erythema, slightly tender, 
 no
 seroma or hematoma Abdomen: Abdomen soft, non-tender.
 
 It would be preferable to me to put sentence breaks in between the sections, 
 so
 the first two sentences would be:
 
 1) PE: Lymphonodes...
 2) Lungs: normal...
 
 but without any candidate characters to split the sentence I don't think it is
 possible.
 
 Another example that breaks our model in a different way (truncated):
 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with
 RN chemo teach  3. S U parent study
 
 Our model will break on the period after the number, so we'd probably get:
 1.
 Baseline labwork including tumor markers 2.
 Start DD 3.
 S U parent study
 
 So the number is going in exactly the wrong place. Here it would be preferable
 to get:
 1.
 Baseline labwork...
 2.
 Start DD...
 3.
 S U parent study
 
 Anyways, just something to think about! The problem is much more complex in
 clinical data than in edited text, but I'm sure we all knew that already :)
 
 Tim
 
 
 
 From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
 Sent: Monday, July 28, 2014 2:38 PM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation
 
 Yes, you're right about that Britt. I've been doing some annotations side by 
 side
 with a treebank viewer and think I have a pretty good handle on the actual 
 rules.
 
 Basically, if a header or list identifier is followed by a period or a 
 newline it is
 considered a sentence break and otherwise it is part of the sentence.
 
 e.g.
 
 1. 20 mg flomax
 
 is two sentences, while:
 
 1 - 20 mg flomax
 
 is one sentence.
 
 For headings:
 
 Allergies: Pt is allergic to aspirin.
 
 is one sentence, while:
 
 Allergies:
 Pt is allergic to aspirin.
 
 is two sentences.
 
 I'm planning to follow these guidelines.
 
 Tim
 
 On 07/28/2014 01:53 PM, britt fitch wrote:
 
 Thanks for the document, Tim. It seems to not be explicit about how to handle
 sentences occurring in lists.
 
 Are you still considering having the list number as outside of the sentence?
 
 Thanks
 
 Britt
 
 On Jul 25, 2014, at 7:09 AM, Miller, Timothy
 timothy.mil...@childrens.harvard.edumailto:timothy.mil...@childrens.harv
 ard.edu wrote:
 
 
 
 Checking with Guergana and other colleagues here the advice is to have the
 sentence segmenter follow the treebank guidelines for sentence segmentation:
 http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
 
 They are a bit light on detail but fortunately we have some treebanked data 
 so I
 will use that for the training data and hopefully that will illuminate the 
 tricky
 cases.
 
 Tim
 
 
 From: Masanz, James J.
 [masanz.ja...@mayo.edumailto:masanz.ja...@mayo.edu]
 Sent: Tuesday, July 15, 2014 4:39 PM
 To: 'dev@ctakes.apache.orgmailto:dev@ctakes.apache.org'
 Subject: RE: question about sentence segmentation
 
 Sorry, I don't know if there was a reason.
 
 If you haven't checked with Guergana, you might want to ask her if she had a
 reason or if it was just the way it had been since that corpus was created.
 
 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Tuesday, July 15, 2014 3:34 PM
 To:

RE: code value for vocabulary in dic-lookup-fast

2014-08-06 Thread Finan, Sean

Hi Harpreet,

I don't know if this has yet been answered (I'm still finding vacation-time 
emails), but the Snomed-ct, Rx-norm, etc. codes were removed from the -fast 
dictionary for speed.  Basically, any single UMLS Cui can have multiple 
different snomed-ct codes (for instance), and adding extra rows per-code leads 
to a lot of waste.  A post- Cui assignment step could be performed to assign 
non-unique snomed-ct codes (for instance) to discovered unique Cuis.  I am 
actually (slowly) conceptualizing an annotator that does just that - mapping 
Cuis to other source codes.  It would be an optional annotator, lean and fast.  
No promise on a date for startup code in sandbox.

Sean

 -Original Message-
 From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
 Sent: Friday, July 25, 2014 2:33 PM
 To: dev@ctakes.apache.org
 Subject: code value for vocabulary in dic-lookup-fast
 
 Hello,
 
 I am using ctakes-dictionary-lookup-fast to annotation purposes.
 But, there is no value for
 code  attribute like it was there when I used ctakes-dictionary-lookup.
 
 Is there any way I can find out the code attribute value using 
 ctakes-dictionary-
 lookup-fast?
 
 
 Thank you so much for the help,
 
 Harpreet

RE: v_snomed_fword_lookup view

2014-08-08 Thread Finan, Sean

Hi Clayton,

I don't know how the ytex dictionary lookup works, so I'm afraid that I can't 
help you with an answer.  Maybe Vijay is the best person to do this.  If you 
aren't tied to ytex you could try the new cTakes dictionary-lookup-fast.  I 
tested Patient came in with a malar rash and it found malar and malar 
rash.

Vijay,

At some point the lookup-fast module will be the default for the cTakes 
clinical pipeline.  In order to synchronize the ytex lookup with cTakes, would 
you like to eventually work together on reusing the same code for ytex?  I have 
no idea what ytex does, but I know the ins and outs of the cdl-fast module.

Sean

 -Original Message-
 From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of
 Clayton Turner
 Sent: Friday, August 08, 2014 2:08 PM
 To: dev@ctakes.apache.org
 Subject: v_snomed_fword_lookup view
 
 Hi Everyone:
 
 I have a question about how the v_snomed_fword_lookup view works when
 running the CPE.
 
 So my understanding of the view is that it is a view comprised of the
 ytex.umls_aui_fword table, the umls.mrconso table and bits/pieces from
 other umls tables.
 
 I feel like this is not completely correct or my idea of how the join to
 create the view works is off. For example, let's say I want the CPE to find
 malar  (e.g. malar rash) as a concept in the annotations. It never
 happens after running my CPE descriptor and I cannot find it in my
 v_snomed_fword_lookup view.
 
 select count(*) from umls_aui_fword where fword='malar'; yields 34 results
 
 select count(*) from umls.mrconso where str='malar'; yields 3 results.
 
 So clearly these two tables know what the cui and context(s) are for malar
 . Yet, whenever I run a gold standard set of notes through the CPE,
 malar is constantly flagged as just a word token and the concept is never
 grabbed. This is recurrent for lots of other concepts, as well, I just
 wanted to use an example to illustrate my issue.
 
 Some troubleshooting I already went through:
 1) Reinstalled ytex and umls database objects
 2) Reinstalled a second time after redownloading umls through
 metamorphosys, ensuring that snomed vocabularies were included (also
 checked file sizes and noticed a big difference so I know those
 vocabularies ARE included
 
 Anyone got any ideas as to what the issue could be?
 
 Thank you,
 Clayton Turner

RE: v_snomed_fword_lookup view

2014-08-11 Thread Finan, Sean

Thanks Harpreet,
That is definitely necessary to build!

Those lines should already be in the pom, but commented out.  I think that some 
version/branching issues may have arisen at some point wrt this module ...

If somebody beats me to it then cheers, otherwise I will try to check out 
tonight and get all the bits in place.

Sean

 -Original Message-
 From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
 Sent: Monday, August 11, 2014 1:12 PM
 To: dev@ctakes.apache.org
 Subject: Re: v_snomed_fword_lookup view
 
 Hello Clayton,
   I do not know about ytex, but I did switch from dictionary-lookup to 
 dictionary-
 lookup-fast.
   I update my ctakes-dictionary-lookup-fast project using maven.
   I think I used Team- Update and switched to the latest revision available 
 and
 then
   I downloaded new 3.2 resources from the for umls. and then I added these
 resources to my
   ctakes-dictionary-lookup-fast resources folder and also the classpath in 
 ctakes-
 clinical-pipeline.
 
  Then I changed the pom.xml file which belongs to the whole ctakes project and
 added dependency groupIdorg.apache.ctakes/groupId
 artifactIdctakes-dictionary-lookup-res/artifactId
 version${ctakes.version}/version
 /dependency
 dependency
 groupIdorg.apache.ctakes/groupId
 artifactIdctakes-dictionary-lookup-fast/artifactId
 version${ctakes.version}/version
 /dependency
 
 
  these two dependencies to the file.
 
 
 After this, I also added the dependency
 dependency
 groupIdorg.apache.ctakes/groupId
 artifactIdctakes-dictionary-lookup-fast/artifactId
 /dependency
 
 to the pom.xml of ctakes-clinical-pipeline.
 
 And then add the resources folder in ctakes-clinical-pipeline using build path
 configuration under add class option.
 
 After this it should work.
 
 
 Regards,
 Harpreet
 
 
 
 
 
 
 On Mon, Aug 11, 2014 at 12:44 PM, Clayton Turner caturn...@g.cofc.edu
 wrote:
 
  I still get the same error with the ctakes3.2 branch. Any suggestions?
 
 
  On Mon, Aug 11, 2014 at 12:06 PM, Clayton Turner
  caturn...@g.cofc.edu
  wrote:
 
   I'm going to do a clean install through the repo rather than the
   binaries and see if that fixes my issue because I think I just read
   a past post saying the lookup2 folders exist there.
  
  
   On Mon, Aug 11, 2014 at 11:52 AM, Clayton Turner
   caturn...@g.cofc.edu
   wrote:
  
   When navigating to
   ctakes-dictionary-lookup-fast\desc\analysis_engine
   there are 2 files, assumedly analysis engines.
  
   SnomedLookupAnnotator.xml and SnomedOvLookupAnnotator.xml
  
   If I pick either, I put in my UMLS information but receive an error
   when trying to run the CPE:
  
   Initialization of CAS Processor with name SnomedOvLookupAnnotator
   failed.
   CausedBy: org.apache.uima.resource.ResourceConfigurationException:
   Initialization of CAS processor with name SnomedOvLookupAnnotator
   failed.
   CausedBy: org.apache.uima.resource.ResourceInitializationException:
  Error
   initializing org.apache.uima.resource.impl.DataResource_impl from
   descriptor file:..SnomedLookupAnnotator.xml
   CausedBy: org.apache.uima.resource.ResourceInitializationException:
  Could
   not
   access the resource data at
  
  
  file:org\apache\ctakes\dictionary\lookup2\Snomed2011ab_ctakesTui\cTake
  sSnomed.xml
  
   Now, I don't even have a lookup2 folder and, subsequently the Tui
   folder and cTakesSnomed.xml file. This seems to be the problem, but
   I'm
  not
   sure where these files are supposed to be grabbed from.
  
  
   On Mon, Aug 11, 2014 at 11:47 AM, Clayton Turner
   caturn...@g.cofc.edu
   wrote:
  
   Hi again:
  
   How exactly do you switch to using the cTakes dictionary-lookup-fast.
  Do
   I need to go in and alter xml files or is it as simple as adding a
  certain
   item to the list of analysis engines?
  
  
   On Fri, Aug 8, 2014 at 3:48 PM, Finan, Sean 
   sean.fi...@childrens.harvard.edu wrote:
  
   Hi Clayton,
  
   I don't know how the ytex dictionary lookup works, so I'm afraid
   that
  I
   can't help you with an answer.  Maybe Vijay is the best person to
   do
  this.
If you aren't tied to ytex you could try the new cTakes
   dictionary-lookup-fast.  I tested Patient came in with a malar rash
  and
   it found malar and malar rash.
  
   Vijay,
  
   At some point the lookup-fast module will be the default for the
  cTakes
   clinical pipeline.  In order to synchronize the ytex lookup with
  cTakes,
   would you like to eventually work together on reusing the same
   code
  for
   ytex?  I have no idea what ytex does, but I know the ins and outs
   of
  the
   cdl-fast module.
  
   Sean
  
-Original Message-
From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On
  Behalf
   Of
Clayton Turner
Sent: Friday, August 08, 2014 2:08 PM
To: dev@ctakes.apache.org
Subject: v_snomed_fword_lookup view
   
Hi Everyone:
   
I have a question about how the v_snomed_fword_lookup view
works
  when
running the CPE

Youtube Channel Apache cTakes

2014-08-12 Thread Finan, Sean

cTakes now has a youtube channel named Apache cTakes.  It is empty, but if 
you have ever made a training video, presentation on a component (descriptors, 
type system, etc.), or demo of integration with another system (UimaFit, 
Uima-AS, etc.) then please feel free to post on that channel.  When there is 
content the Apache pages can have a link to the channel.

Sean

RE: v_snomed_fword_lookup view

2014-08-13 Thread Finan, Sean

is the purpose of a CasConsumer to essentially save your data

Correct, though it is a generic (and archaic) term indicating any end-user of 
the cas.  

 -Original Message-
 From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of
 Clayton Turner
 Sent: Wednesday, August 13, 2014 2:10 PM
 To: dev@ctakes.apache.org
 Subject: Re: v_snomed_fword_lookup view

 Oh okay, so is the purpose of a CasConsumer to essentially save your data in a
 representation that you can do some kind of data mining or classification on 
 it?
 If so, then I think I need to look into making/using one of those.

 On Wed, Aug 13, 2014 at 1:41 PM, Finan, Sean 
 sean.fi...@childrens.harvard.edu wrote:

  Hi Clayton,

  I'm glad that you got it working.  Though I stated that I would, I
  haven't yet checked the fidelity of trunk.  Urgent data request one
  day, must have writing the next ... and I still live with the
  delusion that I left academia to have free time ...

  I have never used ytex or weka, so I'm unfamiliar with all things .arff .
   Could it be that the ytex .arff exporter needs to change consumed
  cTakes annotation classes (3.1)?

  I have a custom CasConsumer that saves text spans and Cuis to file in
  a simple list, and that is what I used for the performance analysis of
  the lookup module.  For our other projects here in Beantown we have
  other various outputs that fit the job at hand: text flat files, xml
  files, sql database tables, knot-encoded lace doilies, etc.

  I'm sure that none of the above helps you, but I felt obliged to
  provide some kind of answer to your question.

  Sean

   -Original Message-
   From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf
   Of Clayton Turner
   Sent: Wednesday, August 13, 2014 12:25 PM
   To: dev@ctakes.apache.org
   Subject: Re: v_snomed_fword_lookup view

   Okay, I believe I have ctakes dictionary fast working now. Something
   I'm
  curious
   about, though, is how you extract the data in order to conduct analysis.

   I've, in the past, been using the SparseDataExporterImpl from ytex
   in
  order to
   create a .arff file for use in weka, but the ctakes pipeline I'm
   using
  doesn't seem
   to be compatible with this ytex exporting as I'm not getting any
   cuis in
  my arff
   file.

   I'm using the aggregate plain text umls processor analysis engine
   from
  ctakes
   and then using the dbconsumer analysis engine from ytex (for storing
  into the
   database with regard to analysis batch).

   Any tips for exporting or some simple issue I'm missing?

   Thanks,
   Clayton

   On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja hsk5...@rit.edu
   wrote:

Yes, absolutely and
no problem at all.

Regards,
Harpreet

On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean 
sean.fi...@childrens.harvard.edu wrote:

 Thanks Harpreet,
 That is definitely necessary to build!

 Those lines should already be in the pom, but commented out.  I
 think
that
 some version/branching issues may have arisen at some point wrt
 this
module
 ...

 If somebody beats me to it then cheers, otherwise I will try to
 check out tonight and get all the bits in place.

 Sean

  -Original Message-
  From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
  Sent: Monday, August 11, 2014 1:12 PM
  To: dev@ctakes.apache.org
  Subject: Re: v_snomed_fword_lookup view

  Hello Clayton,
I do not know about ytex, but I did switch from
  dictionary-lookup to
 dictionary-
  lookup-fast.
I update my ctakes-dictionary-lookup-fast project using maven.
I think I used Team- Update and switched to the latest
  revision
 available and
  then
I downloaded new 3.2 resources from the for umls. and then I
  added
 these
  resources to my
ctakes-dictionary-lookup-fast resources folder and also the
  classpath
 in ctakes-
  clinical-pipeline.

   Then I changed the pom.xml file which belongs to the whole
  ctakes
 project and
  added dependency groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-res/artifactId
  version${ctakes.version}/version
  /dependency
  dependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-fast/artifactId
  version${ctakes.version}/version
  /dependency

   these two dependencies to the file.

  After this, I also added the dependency
  dependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-fast/artifactId
  /dependency

  to the pom.xml of ctakes-clinical-pipeline.

  And then add the resources folder in ctakes-clinical-pipeline
  using
 build path
  configuration under add class option.

  After

RE: v_snomed_fword_lookup view

2014-08-13 Thread Finan, Sean

You can find example Cas Consumers in cTakes-core ..[dirPath]../cc/

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Wednesday, August 13, 2014 2:20 PM
 To: dev@ctakes.apache.org
 Subject: Re: v_snomed_fword_lookup view

 There's nothing conceptually special about the consumer model vs.
 regular annotators (Analysis Engines). You can write an output format from any
 analysis engine as long as it is after the annotations you need in the 
 pipeline. If
 you have global constraints (like in an ARFF file I think you need to know 
 all the
 CUIs in your corpus to write the attribute list?), then it is important to 
 use the
 process() method [called once per document] to store CUIs in a non-UIMA class
 variable (for example, a map from file id to a list/set/multiset of CUIs), 
 and then
 use the collectionProcessComplete() [called once after all documents have been
 processed] method to do the actual writing of the file.

 Hope that is useful, sorry I couldn't tie it in to your previous YTEX 
 exporter but
 I'm not familiar with that process.

 Tim

 On 08/13/2014 02:11 PM, Clayton Turner wrote:
  Oh okay, so is the purpose of a CasConsumer to essentially save your
  data in a representation that you can do some kind of data mining or
  classification on it?  If so, then I think I need to look into
  making/using one of those.

  On Wed, Aug 13, 2014 at 1:41 PM, Finan, Sean 
  sean.fi...@childrens.harvard.edu wrote:

  Hi Clayton,

  I'm glad that you got it working.  Though I stated that I would, I
  haven't yet checked the fidelity of trunk.  Urgent data request one
  day, must have writing the next ... and I still live with the
  delusion that I left academia to have free time ...

  I have never used ytex or weka, so I'm unfamiliar with all things .arff .
   Could it be that the ytex .arff exporter needs to change consumed
  cTakes annotation classes (3.1)?

  I have a custom CasConsumer that saves text spans and Cuis to file in
  a simple list, and that is what I used for the performance analysis
  of the lookup module.  For our other projects here in Beantown we
  have other various outputs that fit the job at hand: text flat files,
  xml files, sql database tables, knot-encoded lace doilies, etc.

  I'm sure that none of the above helps you, but I felt obliged to
  provide some kind of answer to your question.

  Sean

  -Original Message-
  From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf
  Of Clayton Turner
  Sent: Wednesday, August 13, 2014 12:25 PM
  To: dev@ctakes.apache.org
  Subject: Re: v_snomed_fword_lookup view

  Okay, I believe I have ctakes dictionary fast working now. Something
  I'm
  curious
  about, though, is how you extract the data in order to conduct analysis.

  I've, in the past, been using the SparseDataExporterImpl from ytex
  in
  order to
  create a .arff file for use in weka, but the ctakes pipeline I'm
  using
  doesn't seem
  to be compatible with this ytex exporting as I'm not getting any
  cuis in
  my arff
  file.

  I'm using the aggregate plain text umls processor analysis engine
  from
  ctakes
  and then using the dbconsumer analysis engine from ytex (for storing
  into the
  database with regard to analysis batch).

  Any tips for exporting or some simple issue I'm missing?

  Thanks,
  Clayton

  On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja hsk5...@rit.edu
  wrote:

  Yes, absolutely and
  no problem at all.

  Regards,
  Harpreet

  On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean 
  sean.fi...@childrens.harvard.edu wrote:

  Thanks Harpreet,
  That is definitely necessary to build!

  Those lines should already be in the pom, but commented out.  I
  think
  that
  some version/branching issues may have arisen at some point wrt
  this
  module
  ...

  If somebody beats me to it then cheers, otherwise I will try to
  check out tonight and get all the bits in place.

  Sean

  -Original Message-
  From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
  Sent: Monday, August 11, 2014 1:12 PM
  To: dev@ctakes.apache.org
  Subject: Re: v_snomed_fword_lookup view

  Hello Clayton,
I do not know about ytex, but I did switch from
  dictionary-lookup to
  dictionary-
  lookup-fast.
I update my ctakes-dictionary-lookup-fast project using maven.
I think I used Team- Update and switched to the latest revision
  available and
  then
I downloaded new 3.2 resources from the for umls. and then I
  added
  these
  resources to my
ctakes-dictionary-lookup-fast resources folder and also the
  classpath
  in ctakes-
  clinical-pipeline.

   Then I changed the pom.xml file which belongs to the whole
  ctakes
  project and
  added dependency groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-res/artifactId
  version${ctakes.version}/version
  /dependency
  dependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes

RE: Web server

2014-08-21 Thread Finan, Sean

Hi John,
Have you (or another) thought about modifying the Uima Simple Server to run a 
cTakes pipeline?
http://uima.apache.org/sandbox.html#simple-server

 -Original Message-
 From: John Green [mailto:john.travis.gr...@gmail.com]
 Sent: Thursday, August 21, 2014 3:06 PM
 To: dev@ctakes.apache.org
 Subject: Web server

 Im trying to deploy the cTakes web-server code someone already wrote (who
 wrote it btw?). Im running into deployment issues in eclipse with tomcat 7
 on mac... I can get into details but for now: is it in a working state? Im
 learning as I go and it looks in order and the code is solid...

 Also, Pei: did they check in an LVG version that is thread safe now?

 Im really set on getting cTakes into a fluid RESTful interface.

 JG

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean

Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 nodeContextDependentTokenizerAnnotator/node
 nodeDependencyParser/node
 nodeAssertionAnnotator/node

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wondering if there was a way that I could 
eliminate SimpleSegementAnnotator but still be  able to get the medications 
name in that class?

Nick

-Original Message-
From: Pei Chen [mailto:chen...@apache.org]
Sent: Tuesday, September 09, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Nick,
When you mean no medication is being annotated, I presume you mean the 
medication attributes (i.e. dosage, frequency, etc.) are not being annotated?  
I think the DrugNER needs a list of section names in the config; I think it 
includes SIMPLE_SEGMENT.  I am very surprised that SimpleSegementAnnotator is 
the bottle neck though; all it does is assume the entire document is a single 
section called SIMPLE_SEGMENT.
Have you tried commenting out the DependencyParser if you're not using those 
features.

--Pei


On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish snika...@emerginghealthit.com 
wrote:

 Hi there,

 I am using Ctakes to process 5000K free text  records  where each record has 
 several medications.
 This is the fixed flow that it goes through:


 nodeSimpleSegmentAnnotator/node
 
 nodeSentenceDetectorAnnotator/node
 
 nodeTokenizerAnnotator/node
 
 nodeLvgAnnotator/node
 
 nodeContextDependentTokenizerAnnotator/node
 
 nodePOSTagger/node
 
 nodeChunker/node
 
 nodeLookupWindowAnnotator/node
 
 nodeDictionaryLookupAnnotatorDB/node

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean

Just use it with cTakes.  Instead of removing other modules from the pipeline, 
replace the dictionary-lookup with dictionary-lookup-fast.

For the 
desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
 , you would modify:

delegateAnalysisEngine key=DictionaryLookupAnnotatorDB
  import 
location=../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml/
/delegateAnalysisEngine

To be:

delegateAnalysisEngine key=DictionaryLookupAnnotatorDB
  import 
location=../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml/
/delegateAnalysisEngine


That should be it.  You can then leave the rest of the module specifications 
alone.

Sean


From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:32 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Sean,

Many thanks, I will try it tomorrow. Do you have any special instruction to run 
that scrip or I have to use it with cTakes?

Thanks,
Nick

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:24 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 nodeContextDependentTokenizerAnnotator/node
 nodeDependencyParser/node
 nodeAssertionAnnotator/node

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wondering if there was a way that I could 
eliminate SimpleSegementAnnotator but still be  able to get the medications 
name in that class?

Nick

-Original Message-
From: Pei Chen [mailto:chen...@apache.org]
Sent: Tuesday, September 09, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Nick,
When you mean no medication is being annotated, I presume you mean the 
medication attributes (i.e. dosage, frequency, etc.) are not being annotated?  
I think the DrugNER needs a list of section names in the config; I think it 
includes SIMPLE_SEGMENT.  I am very surprised that SimpleSegementAnnotator is 
the bottle neck though; all it does is assume the entire document is a single

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean

Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes

I urge anyone interested in comparing cTakes CASes / output to use this type of 
approach.  Comparison of program output is a post-process task, and unless 
absolutely necessary code to juggle data and metadata belongs there.  Attempts 
to force every module past, present and Future to abide by fixed orderings, 
enumerations etc. is not as simple a task as one might initially think - 
especially if third-party libraries are involved.  I won't get into problems 
associated with why one is comparing output (swapped module?) and IDs, orders 
etc. being different because of a possibly intentional difference.

In addition to or instead of creating a post-processing script, one could write 
a new cas-consumer that writes output in a desired format - but this should 
not require changes to engines.

If it ain't broke, don't fix it

Sean


-Original Message-
From: Steven Bethard [mailto:steven.beth...@gmail.com] 
Sent: Monday, October 06, 2014 11:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same files
 because annotations are often assigned different IDs, are listed in
 different order, etc.

At one point, I spent some time writing a script for diff-ing CASes
that intended to address some of these kinds of issues. It's still
here in cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java

You might see if you could use or adapt that to your needs.

Steve

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean

Hi Kim,

One might want compare the Sentence detector that uses end of line characters 
as sentence splitters with one that does not.  Such a change in sentence 
splitting would not only effect the sentence type discoveries but also 
practically every type that follows.

Another might want to compare a note with skin cancer vs. one in which you 
replace skin cancer with melanoma just to see what the CUI differences 
might be.  There are changes in two words vs. one, 11 characters vs. 8, a 
removed adjective(?), and of course changes in CUIs.

Of course, if you are just running notes on a new moon and then again on a full 
moon ...

Sean

-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:41 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Sean,

...being different because of a possibly intentional difference.

I would like you to elaborate a bit on the what would be intentionally 
different between the processing of the same document multiple times. It would 
help my understanding of cTakes.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use this type 
 of approach.  Comparison of program output is a post-process task, and unless 
 absolutely necessary code to juggle data and metadata belongs there.  
 Attempts to force every module past, present and Future to abide by fixed 
 orderings, enumerations etc. is not as simple a task as one might initially 
 think - especially if third-party libraries are involved.  I won't get into 
 problems associated with why one is comparing output (swapped module?) and 
 IDs, orders etc. being different because of a possibly intentional difference.

 In addition to or instead of creating a post-processing script, one could 
 write a new cas-consumer that writes output in a desired format - but this 
 should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it 
 difficult to compare the output between subsequent runs on the same 
 files because annotations are often assigned different IDs, are 
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes 
 that intended to address some of these kinds of issues. It's still 
 here in cTAKES:

 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java

 You might see if you could use or adapt that to your needs.

 Steve

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean

Hi Kim,

 It concerns me a bit by making the code return consistent results would be so 
 concerning. 
Could you please clarify what you mean by consistent results?  Do you mean 
ordering and IDs or are you talking about actual type values not matching?

This should be the default mode of operation.
Depending upon what you meant above, I may agree or disagree.

 Since it doesn't appear that there are any consequences with moving forward 
 with changing the code
Why do you say this?  

I think that there may be more required changes than you realize.  Every 
insertion into the CAS must be of ordered data.  This means that, for instance, 
named entities discovered by dictionary will need to be inserted in some 
predictable order, such as by alphabetized cui per every alphabetized tui (and 
other code) per ordered text span.  You will need to check and recheck every 
point at which the CAS is modified by every module.  Right now there are at 
least three or four places in two cTakes dictionary modules where a change 
would be required - and that doesn't include YTEX lookup.

If you really feel strongly about this and are going to change cTakes code, 
then I suggest (at the risk of sounding like a complete jerk) that you also 
consider the following:
1.  Don't check anything into trunk until all is well with your changes and 
tests
Just in case you abandon the effort
2.  Write unit tests for every change   
True, Map to LinkedMap shouldn't break anything, but they are good to have, and 
may prevent others in the future from switching back to a non-linked map or any 
unordered collection (set not list, etc.).  It also makes a better place for 
explanation in Javadoc than inlines above the code.
3.  Run memory requirement tests before all of your changes and then again 
after your changes
I'm actually curious about how much memory might be eaten with linkages 
everywhere
4.  Run performance (speed) tests before and after
On a large corpus to ensure that garbage collection is involved
5.  Do the above with every combination possible in current workflows: every 
combination of available sentence detector, pos tagger, smoking status 
detector, dictionary lookup, cas consumer, etc.
As soon as somebody says all output is consistently ordered between runs it 
had better be so for every possible workflow
6.  Write system tests to ensure ordered/predicted outputs with each combination
Otherwise somebody may break it
7.  Document the what, how, and why for future development
Otherwise somebody won't know to stick to the new rules
8.  Assist anybody as needed that in the future breaks one of these unit or 
system tests with a fix or new feature
By mandating such a rule you are assuming responsibility for it
9.  Assist anybody as needed that in the future adds a new module or workflow 
to cTakes to abide by the ordering requirement
By mandating such a rule you are assuming responsibility for it
10.  Assist anybody as needed that in the future adds a new module or workflow 
to add system tests to ensure maintenance of the ordering requirement
By mandating such a rule you are assuming responsibility for it


-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 11:57 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear that 
there are any consequences with moving forward with changing the code, we would 
really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without 
 the UIMA id that is causing your issues) should meet these needs I 
 believe.



 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert 
 kim.eb...@perfectsearchcorp.com 
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes 
 configurations you would expect different output. In our testing 
 we've found several cases where running with the same configuration 
 outputs different data under different moons. Having consistent 
 results helps us know if we've made improvements to our quality or 
 not. Having output that is in a predictable order makes checking to 
 see if there are differences much cheaper when you are dealing with larger 
 data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line 
 characters as sentence splitters with one that does not.  Such a 
 change in sentence splitting would not only effect the sentence type 
 discoveries but also

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean

, Kim Ebert kim.eb...@perfectsearchcorp.com
wrote:

 Hi Sean,

 No, your not a jerk. These are things worth considering, and I 
 understand your concerns with touching various points of the codebase.

 I'll talk with our group over here and see where we want to go. We are 
 really interested in cTakes behaving well, so we are usually pretty 
 careful in testing our changes before committing anything.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 10:46 AM, Finan, Sean wrote:
  Hi Kim,
 
  It concerns me a bit by making the code return consistent results 
  would
 be so concerning.
  Could you please clarify what you mean by consistent results?  Do 
  you
 mean ordering and IDs or are you talking about actual type values not 
 matching?
 
  This should be the default mode of operation.
  Depending upon what you meant above, I may agree or disagree.
 
  Since it doesn't appear that there are any consequences with moving
 forward with changing the code
  Why do you say this?
 
  I think that there may be more required changes than you realize.  
  Every
 insertion into the CAS must be of ordered data.  This means that, for 
 instance, named entities discovered by dictionary will need to be 
 inserted in some predictable order, such as by alphabetized cui per 
 every alphabetized tui (and other code) per ordered text span.  You 
 will need to check and recheck every point at which the CAS is 
 modified by every module.  Right now there are at least three or four 
 places in two cTakes dictionary modules where a change would be 
 required - and that doesn't include YTEX lookup.
 
  If you really feel strongly about this and are going to change 
  cTakes
 code, then I suggest (at the risk of sounding like a complete jerk) 
 that you also consider the following:
  1.  Don't check anything into trunk until all is well with your 
  changes
 and tests
  Just in case you abandon the effort
  2.  Write unit tests for every change True, Map to LinkedMap 
  shouldn't break anything, but they are good to
 have, and may prevent others in the future from switching back to a 
 non-linked map or any unordered collection (set not list, etc.).  It 
 also makes a better place for explanation in Javadoc than inlines above the 
 code.
  3.  Run memory requirement tests before all of your changes and then
 again after your changes
  I'm actually curious about how much memory might be eaten with 
  linkages
 everywhere
  4.  Run performance (speed) tests before and after On a large corpus 
  to ensure that garbage collection is involved 5.  Do the above with 
  every combination possible in current workflows:
 every combination of available sentence detector, pos tagger, smoking 
 status detector, dictionary lookup, cas consumer, etc.
  As soon as somebody says all output is consistently ordered between
 runs it had better be so for every possible workflow
  6.  Write system tests to ensure ordered/predicted outputs with each
 combination
  Otherwise somebody may break it
  7.  Document the what, how, and why for future development Otherwise 
  somebody won't know to stick to the new rules 8.  Assist anybody as 
  needed that in the future breaks one of these unit
 or system tests with a fix or new feature
  By mandating such a rule you are assuming responsibility for it 9.  
  Assist anybody as needed that in the future adds a new module or
 workflow to cTakes to abide by the ordering requirement
  By mandating such a rule you are assuming responsibility for it 10.  
  Assist anybody as needed that in the future adds a new module or
 workflow to add system tests to ensure maintenance of the ordering 
 requirement
  By mandating such a rule you are assuming responsibility for it
 
 
  -Original Message-
  From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
  Sent: Tuesday, October 07, 2014 11:57 AM
  To: dev@ctakes.apache.org
  Subject: Re: cTakes output predictability
 
  I think we may really prefer the first method. Since it doesn't 
  appear
 that there are any consequences with moving forward with changing the 
 code, we would really like to move forward with this approach.
 
  Kim Ebert
  1.801.669.7342
  Perfect Search Corp
  http://www.perfectsearchcorp.com/
 
  On 10/07/2014 09:35 AM, britt fitch wrote:
  The option Sean mentioned of writing your own custom consumer 
  (without the UIMA id that is causing your issues) should meet these 
  needs I believe.
 
 
 
  Britt Fitch
  Wired Informatics
  265 Franklin St Ste 1702
  Boston, MA 02110
  http://wiredinformatics.com
  britt.fi...@wiredinformatics.com
 
  On Oct 7, 2014, at 11:29 AM, Kim Ebert 
  kim.eb...@perfectsearchcorp.com 
  mailto:kim.eb...@perfectsearchcorp.com wrote:
 
  Hi Sean,
 
  Well of course that makes plenty of sense. Testing different 
  cTakes configurations you would expect different output. In our 
  testing we've found several cases where running with the same 
  configuration

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean

);
}
}

This will at most return one item from the Set. Since the set is an unordered 
hash, this will result in one of three options to be returned.
Is this a bug, or a design decision. Which one is right? Which one is wrong? It 
maybe this is a disign decision, but it would be nice if we are consistently 
right, or consistently wrong. Many other instances of this result in similar 
issues.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 12:43 PM, Finan, Sean wrote:
 I'm just about sapped on this topic.  What comes below is my final writing.

 Kim wrote:
 Yes, I mean actual type values not matching.
 Ok, this is a very serious problem and should have nothing to do with 
 ordering and/or IDs.  I repeat: this should have nothing to do with ordering 
 or ids.  Reordering or changing ID assignment, while possibly producing 
 repeatable output, will not necessary fix the actual bug.  Please write a 
 Jira for each item, and (imo) we should think about withholding any 
 non-bug-fix release until they have been dealt with.

 Bruce wrote:
 I did not intend to step on anyone's toes.
 No worries - I don't think that any toes have been stepped upon. It is good 
 that questions and concerns are shared with the group.  

 Note that in the first instance, there were two MedicationMentions, but in 
 the second, there is only one.
 Assuming that the second drug mention doesn't appear elsewhere in output2 
 then this needs to be addressed.  Please log a tar.  Relating this to the 
 order/id issue, which number of mentions is correct (2)?  If you reorder will 
 that consistently output two medications instead of one or one medication 
 instead of two?  This is most likely a bug in the identification and/or 
 storage and/or retrieval code and needs to be fixed there.

 Yes, everyone could write their own custom compare code, but wouldn't it be 
 more valuable to the community to make that task easier?
 I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations 
 could be started and people could add to it as needed.  I would also hope 
 that a reusable post-process comparison utility could be started and 
 improved/maintained.

 Sean


 -Original Message-
 From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 1:21 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 I did not intend to step on anyone's toes.

 One of the reasons I proposed the changes was to try to make it extremely 
 obvious when there are significant difference in output from the cTakes 
 pipeline when running the same document again, and once identified, make it 
 easier to identify the source of the difference.

 Because of the huge number of differences between the output using the 
 FileWriterCasConsumer.xml, first detecting that there is a significant 
 differences and identifying them for a large set of documents is a daunting 
 task.

 The following is an example of some significant differences that I 
 have detected between two subsequent runs on the same document using 
 the current release of cTakes. (There are actually quite a few 
 documents that exhibit this kind of behavior. This is only one 
 example.)


 Snippet from first run:

 org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
 _indexed=1 _id=9869 _ref_sofa=3 begin=3039 end=3047/
 org.apache.ctakes.typesystem.type.textsem.MedicationMention
 _indexed=1 _id=9895 _ref_sofa=3 begin=2075 end=2081 id=95
 _ref_ontologyConceptArr=9891 typeID=1 segmentID=SIMPLE_SEGMENT
 discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1
 conditional=false generic=true subject=patient historyOf=0/
 org.apache.ctakes.typesystem.type.textsem.MedicationMention
 _indexed=1 _id=9937 _ref_sofa=3 begin=2312 end=2322 id=110
 _ref_ontologyConceptArr=9934 typeID=1 segmentID=SIMPLE_SEGMENT
 discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1
 conditional=false generic=false subject=patient historyOf=0/
 org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
 _indexed=1 _id=9979 _ref_sofa=3 begin=0 end=4 id=0
 _ref_ontologyConceptArr=9976 typeID=2 segmentID=SIMPLE_SEGMENT
 discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=0
 conditional=false generic=false subject=patient historyOf=0/


 Snippet from subsequent trun:

 org.apache.ctakes.typesystem.type.textsem.ProcedureMention
 _indexed=1 _id=15773 _ref_sofa=3 begin=2929 end=2933 id=125
 _ref_ontologyConceptArr=15770 typeID=5 segmentID=SIMPLE_SEGMENT
 discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=0
 conditional=false generic=false subject=patient historyOf=0/
 org.apache.ctakes.typesystem.type.textsem.MedicationMention
 _indexed=1 _id=15928 _ref_sofa=3 begin=2075 end=2081 id=95
 _ref_ontologyConceptArr=15924 typeID=1 segmentID=SIMPLE_SEGMENT
 discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1
 conditional=false generic=true subject=patient

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-08 Thread Finan, Sean

Hi Bruce,
I would venture to say that this is neither expected nor desired.

Before you fix it (or in addition to a fix), try to run with the new dictionary 
lookup.   It will have a different behavior, and it will be the default 
dictionary lookup in future releases of cTakes – making fixes to the current 
module slightly less urgent.

Sean

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Wednesday, October 08, 2014 11:38 AM
To: dev@ctakes.apache.org
Subject: Differences in MedicationMention annotations on subsequent processing 
runs


I have encountered a situation in which the cTakes clinical pipeline output 
differs between multiple runs on the same text with the same configuration.
The following snippets from a single document are sufficient to demonstrate the 
issue:

 a gentle curve going into. irrigated with Bacitracin.

The source of the difference is that the DictionaryLookupAnnotator uses a map 
to filter out duplicate annotations for a single document location:
// used to prevent duplicate hits
// key = hit begin,end key (java.lang.String)
// val = Set of MetaDataHit objects
private MapString,SetMetaDataHit iv_dupMap = new HashMap();

This map is shared between both the umls_ms_2011ab lookup and the 
umls_ms_2011an_rxnorm lookup,

If both dictionaries contain the same term, the order of dictionary lookup 
execution determines the output.If the rxnorm lookup runs first, then a 
MedicationMention annotation for Bacitracin appears in the final output. If the 
standard umls lookup runs first, then there is no MedicationMention annotation 
for Bacitracin.
I will attach the output from the subsequent runs. (Hopefully the attachment 
will make it through the system)

Is this expected behavior? If not, what would be the expected behavior?

[Image removed by sender. IMAT Solutions]http://imatsolutions.com
Bruce Tietjen
Senior Software Engineer
[Image removed by sender. Mobile:]801.634.1547
bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-08 Thread Finan, Sean

Good point ...
I tried to check in to sourceforge but had problems.  I will try again right 
now ...

Building a custom dictionary is possible with the DictionaryTool in cTakes 
sandbox, but that is a different rabbit hole.

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the 
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean  sean.fi...@childrens.harvard.edu 
wrote:

  Hi Bruce,

 I would venture to say that this is neither expected nor desired.



 Before you fix it (or in addition to a fix), try to run with the new
 dictionary lookup.   It will have a different behavior, and it will be the
 default dictionary lookup in future releases of cTakes – making fixes 
 to the current module slightly less urgent.



 Sean



 *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
 *Sent:* Wednesday, October 08, 2014 11:38 AM
 *To:* dev@ctakes.apache.org
 *Subject:* Differences in MedicationMention annotations on subsequent 
 processing runs





 I have encountered a situation in which the cTakes clinical pipeline 
 output differs between multiple runs on the same text with the same 
 configuration.

 The following snippets from a single document are sufficient to 
 demonstrate the issue:

  a gentle curve going into. irrigated with Bacitracin.



 The source of the difference is that the DictionaryLookupAnnotator 
 uses a map to filter out duplicate annotations for a single document location:

 // used to prevent duplicate hits
 // key = hit begin,end key (java.lang.String)
 // val = Set of MetaDataHit objects
 private MapString,SetMetaDataHit iv_dupMap = new HashMap();

  This map is shared between both the umls_ms_2011ab lookup and the 
 umls_ms_2011an_rxnorm lookup,



 If both dictionaries contain the same term, the order of dictionary 
 lookup execution determines the output.If the rxnorm lookup runs 
 first, then a MedicationMention annotation for Bacitracin appears in 
 the final output. If the standard umls lookup runs first, then there 
 is no MedicationMention annotation for Bacitracin.

 I will attach the output from the subsequent runs. (Hopefully the 
 attachment will make it through the system)



 Is this expected behavior? If not, what would be the expected behavior?



 [image: Image removed by sender. IMAT Solutions] 
 http://imatsolutions.com

 *Bruce Tietjen*
 Senior Software Engineer
 [image: Image removed by sender. Mobile:]801.634.1547 
 bruce.tiet...@imatsolutions.com

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-08 Thread Finan, Sean

Hi Bruce,

With Pei's help I just updated the sourceforge repo with the cTakes 
dictionaries.  Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab

Sean

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] http://imatsolutions.com
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean 
sean.fi...@childrens.harvard.edu wrote:

  Hi Bruce,

 I would venture to say that this is neither expected nor desired.



 Before you fix it (or in addition to a fix), try to run with the new
 dictionary lookup.   It will have a different behavior, and it will be the
 default dictionary lookup in future releases of cTakes – making fixes to
 the current module slightly less urgent.



 Sean



 *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
 *Sent:* Wednesday, October 08, 2014 11:38 AM
 *To:* dev@ctakes.apache.org
 *Subject:* Differences in MedicationMention annotations on subsequent
 processing runs





 I have encountered a situation in which the cTakes clinical pipeline
 output differs between multiple runs on the same text with the same
 configuration.

 The following snippets from a single document are sufficient to
 demonstrate the issue:

  a gentle curve going into. irrigated with Bacitracin.



 The source of the difference is that the DictionaryLookupAnnotator uses a
 map to filter out duplicate annotations for a single document location:

 // used to prevent duplicate hits
 // key = hit begin,end key (java.lang.String)
 // val = Set of MetaDataHit objects
 private MapString,SetMetaDataHit iv_dupMap = new HashMap();

  This map is shared between both the umls_ms_2011ab lookup and the
 umls_ms_2011an_rxnorm lookup,



 If both dictionaries contain the same term, the order of dictionary lookup
 execution determines the output.If the rxnorm lookup runs first, then a
 MedicationMention annotation for Bacitracin appears in the final output. If
 the standard umls lookup runs first, then there is no MedicationMention
 annotation for Bacitracin.

 I will attach the output from the subsequent runs. (Hopefully the
 attachment will make it through the system)



 Is this expected behavior? If not, what would be the expected behavior?



 [image: Image removed by sender. IMAT Solutions]
 http://imatsolutions.com

 *Bruce Tietjen*
 Senior Software Engineer
 [image: Image removed by sender. Mobile:]801.634.1547
 bruce.tiet...@imatsolutions.com

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-09 Thread Finan, Sean

 DictionaryLookupAnnotator which is a container for the dictionaries and it 
 iterates through the list of lookup dictionaries

I am confused.  The new dictionary-lookup-fast has neither this class nor 
multiple dictionaries.  The umls and rxnorm are in the same database table and 
lookup is performed in one swoop.  Could you please send a copy of your 
pipeline xmls to me directly (instead of bombing the group) with something 
other than an .xml extension (they get blocked)?



From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
Sent: Thursday, October 09, 2014 11:41 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

I tried the Dictionary-lookup-fast module and the bahavior is the same. I did 
have to run it a number of times before timing was right to reproduce the 
issue. With the older lookup, chances were about 50/50 between which dictionary 
ran first. Using the dictionary-fast, it seems more like 70/30 with the 
standard umls lookup being more likely to run first than not. Which means that 
most of the time, there is no MedicationMention annotation for Bacitracin.  
(See Attached)

The code with the issue is the DictionaryLookupAnnotator which is a container 
for the dictionaries and it iterates through the list of lookup dictionaries so 
that part of the code path does not seem to have changed.

In the past, the rxNorm dictionary was a Lucene search and so I'm guessing it 
behaved a little differently than it does now with both being JDBC.

The fact that the filter is at this location seems to indicate that it may have 
been by intended for it to be across all dictionaries. On the other hand, it 
appears to mask out the lookups for the different dictionaries, resulting in 
some annotations not being made.

So, the real question is how should the filter work -- should the annotation 
filtering be per lookup dictionary, or be across all dictionaries? Or is there 
something wrong elsewhere that causes

I lean towards having the filter function per dictionary. This may risk having 
duplicate annotations, but that would probably be better than missing the 
annotation all together.







[IMAT Solutions]http://imatsolutions.com
Bruce Tietjen
Senior Software Engineer
[Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 10:02 AM, Finan, Sean 
sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu 
wrote:
Hi Bruce,

With Pei's help I just updated the sourceforge repo with the cTakes 
dictionaries.  Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab

Sean

-Original Message-
From: Bruce Tietjen 
[mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] http://imatsolutions.com
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547tel:801.634.1547
bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean 
sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu 
wrote:

  Hi Bruce,

 I would venture to say that this is neither expected nor desired.



 Before you fix it (or in addition to a fix), try to run with the new
 dictionary lookup.   It will have a different behavior, and it will be the
 default dictionary lookup in future releases of cTakes – making fixes to
 the current module slightly less urgent.



 Sean



 *From:* Bruce Tietjen 
 [mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com]
 *Sent:* Wednesday, October 08, 2014 11:38 AM
 *To:* dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
 *Subject:* Differences in MedicationMention annotations on subsequent
 processing runs





 I have encountered a situation in which the cTakes clinical pipeline
 output differs between multiple runs on the same text with the same
 configuration.

 The following snippets from a single document are sufficient to
 demonstrate the issue:

  a gentle curve going into. irrigated with Bacitracin.



 The source of the difference is that the DictionaryLookupAnnotator uses a
 map to filter out duplicate annotations for a single document location:

 // used to prevent duplicate hits
 // key = hit begin,end key (java.lang.String)
 // val = Set of MetaDataHit objects
 private MapString,SetMetaDataHit iv_dupMap = new HashMap();

  This map is shared between both the umls_ms_2011ab lookup and the
 umls_ms_2011an_rxnorm lookup,



 If both dictionaries contain the same term

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-09 Thread Finan, Sean

I just ran the –fast with an example containing  bacitracin in four sentences, 
once being the first word and once being the last.  In ten of ten runs all four 
bacitracin mentions were discovered.

You completely replaced the dictionary lookup with ?
delegateAnalysisEngine key=DictionaryLookupAnnotatorDB
  import 
location=../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml/
/delegateAnalysisEngine


From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Thursday, October 09, 2014 11:42 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

I tried the Dictionary-lookup-fast module and the bahavior is the same. I did 
have to run it a number of times before timing was right to reproduce the 
issue. With the older lookup, chances were about 50/50 between which dictionary 
ran first. Using the dictionary-fast, it seems more like 70/30 with the 
standard umls lookup being more likely to run first than not. Which means that 
most of the time, there is no MedicationMention annotation for Bacitracin.  
(See Attached)
The code with the issue is the DictionaryLookupAnnotator which is a container 
for the dictionaries and it iterates through the list of lookup dictionaries so 
that part of the code path does not seem to have changed.
In the past, the rxNorm dictionary was a Lucene search and so I'm guessing it 
behaved a little differently than it does now with both being JDBC.
The fact that the filter is at this location seems to indicate that it may have 
been by intended for it to be across all dictionaries. On the other hand, it 
appears to mask out the lookups for the different dictionaries, resulting in 
some annotations not being made.

So, the real question is how should the filter work -- should the annotation 
filtering be per lookup dictionary, or be across all dictionaries? Or is there 
something wrong elsewhere that causes
I lean towards having the filter function per dictionary. This may risk having 
duplicate annotations, but that would probably be better than missing the 
annotation all together.





[IMAT Solutions]http://imatsolutions.com
Bruce Tietjen
Senior Software Engineer
[Mobile:]801.634.1547
bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 10:02 AM, Finan, Sean 
sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu 
wrote:
Hi Bruce,

With Pei's help I just updated the sourceforge repo with the cTakes 
dictionaries.  Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab

Sean

-Original Message-
From: Bruce Tietjen 
[mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] http://imatsolutions.com
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547tel:801.634.1547
bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean 
sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu 
wrote:

  Hi Bruce,

 I would venture to say that this is neither expected nor desired.



 Before you fix it (or in addition to a fix), try to run with the new
 dictionary lookup.   It will have a different behavior, and it will be the
 default dictionary lookup in future releases of cTakes – making fixes to
 the current module slightly less urgent.



 Sean



 *From:* Bruce Tietjen 
 [mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com]
 *Sent:* Wednesday, October 08, 2014 11:38 AM
 *To:* dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
 *Subject:* Differences in MedicationMention annotations on subsequent
 processing runs





 I have encountered a situation in which the cTakes clinical pipeline
 output differs between multiple runs on the same text with the same
 configuration.

 The following snippets from a single document are sufficient to
 demonstrate the issue:

  a gentle curve going into. irrigated with Bacitracin.



 The source of the difference is that the DictionaryLookupAnnotator uses a
 map to filter out duplicate annotations for a single document location:

 // used to prevent duplicate hits
 // key = hit begin,end key (java.lang.String)
 // val = Set of MetaDataHit objects
 private MapString,SetMetaDataHit iv_dupMap = new HashMap();

  This map is shared between both the umls_ms_2011ab lookup and the
 umls_ms_2011an_rxnorm lookup,



 If both dictionaries contain the same term, the order of dictionary lookup
 execution determines the output.If

RE: Need information regarding cTakes changes

2014-10-20 Thread Finan, Sean

Hi Chandu,
For your note #2:
 2)Any new features that can be added to current version of cTakes 
 project to make it more useful.
You can always check (or add to) the Jira future enhancement page at:
https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323040/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel

Sean

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Monday, October 20, 2014 2:40 PM
To: dev@ctakes.apache.org
Subject: Re: Need information regarding cTakes changes


On 10/17/2014 05:23 AM, sarath chandra Reddy wrote:
 Hi,

 I am not proposing any changes, as I did not have much knowledge about 
 the cTakes project code. I am requesting the persons who are currently 
 working on the development of cTakes next version.I need their help in 
 answering the questions mentioned in previous mail.

 1) Any possible improvements that can be made to current cTakes 
 version to improve its efficiency ?. Like code-level and design level changes.
Well, the new fast dictionary module should solve one of the biggest issues, 
the bottleneck of the dictionary lookup. Beyond that, it would be nice to 
decrease the memory footprint of the dependency parser.

 2)Any new features that can be added to current version of cTakes 
 project to make it more useful.
Using UIMA-AS allows for scaleout, in combination with the fast dictionary can 
allow very fast processing. Maybe it's not a feature per se, and maybe it will 
come from an outside project, but I think infrastructure that makes it easy to 
get a highly parallel and very fast version of ctakes up and running would be a 
nice addition.

(By the way, that's just one interesting example that came to mind, not 
necessarily the most important or highest priority!)

Tim


 I humbly request the developers to provide me information regarding these.

 Regards,
 Chandu

 On Thu, Oct 16, 2014 at 8:31 PM, Chen, Pei 
 pei.c...@childrens.harvard.edu
 wrote:

 Chanda,
 Could you describe what types of changes you are proposing.

 We'll welcome any contributions.

 Sent from my iPhone

 On Oct 16, 2014, at 5:21 PM, sarath chandra Reddy 
 jscredd...@gmail.com
 wrote:
 Hi,

 I am doing a research work on cTakes . I request the developers 
 working
 on
 the development of cTakes project to answer the following questions.
 Connect me with the right persons.

 --I need three major possible improvements  to the cTakes current 
 --design Also three new features that can be added to the current 
 --cTakes
 project
 I am waiting for your responses. Thank you in advance.

 Regards,
 Chandu

RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download

2014-11-14 Thread Finan, Sean

Hi Andy,

Great stuff!  I think that I understand the method, but I have a question about 
the statement:

the content is publicly available per the NCBI policy and license for MedGen 
sources

Does this mean that I, Joe Anybody, could download the content, place some of 
the content in a database structured in my own fashion, package the -new- 
database, and include it in a cTakes distribution?
Or, does it mean that content downloaded by script is usable as-is and only 
as-is?  The whole if I'd known your were going to do that I wouldn't have 
given it to you ...

Thanks,
Sean


From: andy mcmurry [mcmurry.a...@gmail.com]
Sent: Thursday, November 13, 2014 6:59 PM
To: dev@ctakes.apache.org
Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open 
access download

Pei: Yes, specifically:

The source code was released by Invitae under Apache ASL 2.0 per my request
and with full blessing from our legal counsel and software team. I also
reviewed in principle the idea with John Wilbanks of Sage Bionetworks (and
formerly creative commons). This is legit, or I wouldn't have spent tons of
hours doing it.

The raw content is a set of scripts which wget a list of URLS from the NCBI
public FTP repositories. This code DOES NOT redistribute any content
whatsoever, just a list of URLs to download, unzip, and insert into a local
mysql database. To repeat: I am NOT circulating any content, just URL links
-- you must download the content yourself. And that is the beauty -- all
content is downloaded BY THE USER and the content is publicly available per
the NCBI policy and license for MedGen sources.


On Thu, Nov 13, 2014 at 11:18 AM, Chen, Pei pei.c...@childrens.harvard.edu
wrote:

 John- I believe that was the thinking.
 Andy- Just to confirm- Is the raw content of this dataset released under
 ASL2.0?  i.e. can you contribute it as a CSV or similar so that cTAKES may
 re-tokenize it using the same PTB rules, format it for cTAKES' dictionary
 lookup, etc., and then redistribute it under the same License.

  -Original Message-
  From: John Green [mailto:john.travis.gr...@gmail.com]
  Sent: Thursday, November 13, 2014 1:55 PM
  To: dev@ctakes.apache.org
  Cc: dev@ctakes.apache.org
  Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available
  as open access download
 
  The old licensed setup would be kept as a packaged option? Much as it is
  now With the unlicensed going out in place of the current free
  dictionary? Am I understanding that right?
 
 
  JG
  —
  Sent from Mailbox
 
  On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry
  mcmurry.a...@gmail.com
  wrote:
 
   I'll crunch the numbers -- in the meantime I can tell you that
   phenotypes vary by semantic type. clinical attributes  from SNOMED are
   abundant, many concepts in mesh that are mapped to diseases. Tons of
   pharmacological substances
   On Nov 12, 2014 6:19 AM, Dligach, Dmitriy 
   dmitriy.dlig...@childrens.harvard.edu wrote:
   Andy, thank you for this resource!
  
   Do you have an estimate of what percentage of UMLS concepts were left
  out?
  
   Dima
  
  
  
  
   On Nov 11, 2014, at 16:02, andy mcmurry mcmurry.a...@gmail.com
  wrote:
  
Hello!
   
https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
   
We just released a new library containing a huge chunk of UMLS
concepts which are available without registering
  accounts/username/passwords.
LEGALLY. Yes, really!
   
The subset is from NCBI and it contains *thousands of concepts from
   SNOMED
and other vocabularies*.
   
The code is essentially
1. a list of WGET targets to various NCBI FTP site mirrors 2.
Makefile for building the databases of interest
   
Our legal team has approved distribution for Open Access work, ASL2
LICENSE.
   
I recommend we use this opportunity to make this the default
distribution for CTAKES UMLS connections, because it obviates the
need for so much painful credentialing and back and forth
agreements with the US National Library of Medicine.
   
Cheers!
--Andy
   
   
On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. 
   masanz.ja...@mayo.edu
wrote:
   
   
I would love to see the install be as simple as apt-get install to
end
   up
with some working dictionary that have more than a handful of
entries to get them started.
   
Regards,
James Masanz
   
-Original Message-
From: andy mcmurry [mailto:mcmurry.a...@gmail.com]
Sent: Tuesday, September 09, 2014 4:32 PM
To: ctakes-...@incubator.apache.org
Subject: Recommendation for ctakes default (UMLS) dictionaries
   
Greetings ctakes-dev:
   
*UMLS license restrictions have been getting more lax over the
years -- *much of the UMLS can be downloaded directly from the
NCBI official FTP site.
   
In fact, the NIH (and implicitly the NLM) *have already made the
   standard
terms

RE: Asking help for always unsuccessful AE load

2014-12-04 Thread Finan, Sean

Hi Jun,

Do AE pipelines that do not use the Smoking Status module work?

I think that Smoking Status configuration (via binary install) might be broken 
in the last several versions.  I thought that I had submitted a Jira long, long 
ago, but right now I can't find it so maybe my memory is playing games.  I have 
gotten the module to work, but it took hours to find and fix the problems.  If 
you can get other AEs to run then let me know and I'll try to find my working 
setup and diff it with the cTakes install tomorrow.  If I remember correctly I 
had to move (unpack) some things from lib/ jars to resources/ and change a path 
or two in the desc/ xmls.

Sean


From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Wednesday, December 03, 2014 10:52 AM
To: dev@ctakes.apache.org
Subject: Re: Asking help for always unsuccessful AE load

Hi Jun,

I know this has been a problem in some versions. What version are you using? 
Could you try this out on the latest release candidate to see if it is still an 
issue?

Thanks,

[IMAT Solutions]http://imatsolutions.com
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.commailto:greg.hub...@imatsolutions.com
On 12/02/2014 08:28 PM, Ying, Jun wrote:

Dear Sir/Madam,

When I Load some AE in cTakes like SimulatedProdSmokingTAE.xml, It always 
jump the Exception java.lang.illegalArgumentException: URl is not 
hierarchical. Why it happens? How to fix it.

Thanks.



[X]





The information in this e-mail is intended only for the person to whom it is

addressed. If you believe this e-mail was sent to you in error and the e-mail

contains patient information, please contact the Partners Compliance HelpLine at

http://www.partners.org/complianceline . If the e-mail was sent to you in error

but does not contain patient information, please contact the sender and properly

dispose of the e-mail.

RE: Scaling cTakes

2014-12-05 Thread Finan, Sean

Hi Brandon,

It sounds like you've got  a decent pipeline set up.  To increase the speed you 
could try swapping out use of ctakes-dictionary-lookup with 
ctakes-dictionary-lookup-fast in the AE.  Check 
ctakes-clinical-pipeline/desc/[ae]/AggregatePlaintextFastUMLSProcessor.xml for 
an example.  As for the CASPool, I don't think that it will make any difference 
for cTakes.  

Sean

From: Geise, Brandon D. [bdge...@geisinger.edu]
Sent: Friday, December 05, 2014 12:40 PM
To: dev@ctakes.apache.org
Subject: Scaling cTakes

Hi,

I'm new to cTakes and the UIMA framework.  I've read most of the UIMA 
documentation and was able to take the BagofCUIGenerator example and modify to 
read notes from a DB, process using the UMLS AE in the clinical-pipeline using 
a local DB version of UMLS, and output the CUIs to a DB.  However, the problem 
I'm having is it's extremely slow; ~3.5-4 notes a minute.  I was hoping I could 
get some hints or advice on speeding the process up.  I read there's a patch 
for LVG, but wasn't quite sure how to implement.  Also from testing using the 
CPE GUI, I don't notice any different in processing time by adjusting the 
CASPool setting.  Some advice on the CASPool would be appreciated also.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached 
to it, if any) is confidential and may be legally privileged. It is intended 
solely for the addressee. Access to this message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken, or omitted to be taken, in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please delete all electronic copies of this message (and the documents attached 
to it, if any), destroy any hard copies you may have created and notify me 
immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected 
Health Information and other confidential data contained in external e-mail 
messages. If email is encrypted, the recipient will receive an e-mail 
instructing them to sign on to the Geisinger Health System Secure E-mail 
Message Center to retrieve the encrypted e-mail.

RE: Scaling cTakes

2014-12-09 Thread Finan, Sean

Hi Brandon,

You are welcome.  I was hoping that you'd get the note processing time down to 
under a second with the different lookup, but I guess not.  I think that any 
optimization from here really depends upon what information you want to extract 
from the notes.

Sean

From: Geise, Brandon D. [bdge...@geisinger.edu]
Sent: Tuesday, December 09, 2014 9:13 AM
To: dev@ctakes.apache.org
Subject: RE: Scaling cTakes

Thanks again Sean for the advice.  Just by changing the pipeline to use the 
fast dictionary led to quadrupling the processing speed.  Any other suggestions 
on performance tuning would be great!

Thanks,
Brandon

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, December 05, 2014 1:14 PM
To: dev@ctakes.apache.org
Subject: RE: Scaling cTakes

Hi Brandon,

It sounds like you've got  a decent pipeline set up.  To increase the speed you 
could try swapping out use of ctakes-dictionary-lookup with 
ctakes-dictionary-lookup-fast in the AE.  Check 
ctakes-clinical-pipeline/desc/[ae]/AggregatePlaintextFastUMLSProcessor.xml for 
an example.  As for the CASPool, I don't think that it will make any difference 
for cTakes.

Sean

From: Geise, Brandon D. [bdge...@geisinger.edu]
Sent: Friday, December 05, 2014 12:40 PM
To: dev@ctakes.apache.org
Subject: Scaling cTakes

Hi,

I'm new to cTakes and the UIMA framework.  I've read most of the UIMA 
documentation and was able to take the BagofCUIGenerator example and modify to 
read notes from a DB, process using the UMLS AE in the clinical-pipeline using 
a local DB version of UMLS, and output the CUIs to a DB.  However, the problem 
I'm having is it's extremely slow; ~3.5-4 notes a minute.  I was hoping I could 
get some hints or advice on speeding the process up.  I read there's a patch 
for LVG, but wasn't quite sure how to implement.  Also from testing using the 
CPE GUI, I don't notice any different in processing time by adjusting the 
CASPool setting.  Some advice on the CASPool would be appreciated also.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached 
to it, if any) is confidential and may be legally privileged. It is intended 
solely for the addressee. Access to this message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken, or omitted to be taken, in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please delete all electronic copies of this message (and the documents attached 
to it, if any), destroy any hard copies you may have created and notify me 
immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected 
Health Information and other confidential data contained in external e-mail 
messages. If email is encrypted, the recipient will receive an e-mail 
instructing them to sign on to the Geisinger Health System Secure E-mail 
Message Center to retrieve the encrypted e-mail.

RE: revamping the Apache cTAKES website

2014-12-15 Thread Finan, Sean

Anyway, a pretty amazing fresh start, thanks Pei

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Monday, December 15, 2014 4:33 PM
To: dev@ctakes.apache.org
Subject: RE: revamping the Apache cTAKES website

Check out a mockup of a new website proposal:
http://svn.apache.org/repos/asf/ctakes/site/new/index.html
Based off bootstrap (Idea borrowed from the Spark folks..).

Couple of key pieces of info:
- 10% of visitors are on mobile/tablets
- The most currently visited pages are: downloads.cgi, gettingstarted.html.  I 
suggest we focus our attention on those 2 items.  (Putting a Downloads link 
right on the front page, etc.)

svn co http://svn.apache.org/repos/asf/ctakes/site/new if you want to checkout 
the code of the site.

--Pei

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com]
Sent: Friday, December 05, 2014 6:34 PM
To: dev@ctakes.apache.org
Cc: dev@ctakes.apache.org
Subject: RE: revamping the Apache cTAKES website

I would like to second the bootstrap recommendation, with the additional 
recommendation of django for the backend. It is an amazing platform for rapid 
development and easy updating.


JG
—
Sent from Mailbox

On Fri, Dec 5, 2014 at 12:15 PM, Savova, Guergana 
guergana.sav...@childrens.harvard.edu wrote:

 There are now 4 volunteers:
 Michelle Chen
 Pei Chen
 Sean Finan
 Guergana Savova
 --Guergana
 -Original Message-
 From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]
 Sent: Friday, December 05, 2014 11:56 AM
 To: dev@ctakes.apache.org
 Subject: RE: revamping the Apache cTAKES website Wonderful, thank you, 
 Michelle! There will be a flurry of emails the week of Dec 15 followed by 
 actual work, so book your calendar if possible...
 --Guergana
 -Original Message-
 From: Michelle Chen [mailto:michelle1919c...@gmail.com]
 Sent: Friday, December 05, 2014 11:48 AM
 To: dev@ctakes.apache.org
 Subject: Re: revamping the Apache cTAKES website Hello Guergana, I 
 don't know that much about cTakes, but would be interested in contributing to 
 the effort.
 I'm not sure if there is an interest in matching the website design of other 
 Apache projects, but it seems that the two main designs that are being used 
 from my arbitrary search on http://projects.apache.org/indexes/alpha.html is 
 1. the current design that cTakes is using and 2. a Bootstrap approach.
 I've done a little bit of work on Bootstrap and would be interested in 
 helping with that. Let me know how I can be helpful.
 Sincerely,
 Michelle Chen :)
 Be strong and of good courage; do not be afraid, nor be dismayed, for 
 the Lord your God is with you wherever you go. ~Joshua 1:9 On Fri, Dec 5, 
 2014 at 11:21 AM, Savova, Guergana  guergana.sav...@childrens.harvard.edu 
 wrote:
 cTAKES-ers,

 we would like to start working on updating the Apache cTAKES website
 - some of the information there is already stale and needs refreshing.
 Do you have ideas on website design, content, etc.? Would you like to 
 contribute to the effort? We are planning to start working on the 
 website the week of Dec 15.

 Cheers,
 --Guergana

RE: Problem running cTakes-clinical pipeline -- AggregatePlaintextFastUMLSProcessor.xml

2014-12-15 Thread Finan, Sean

Hi Yu,

 Also do you know is there any command line I can run to annotate like a 
 thousand files automatically rather than copy and paster.

You could try the CPE gui : bin/runctakesCPE.sh

Sean

From: Liang, Yu [mailto:yu.li...@nyumc.org]
Sent: Monday, December 15, 2014 4:51 PM
To: dev@ctakes.apache.org
Subject: Problem running cTakes-clinical pipeline -- 
AggregatePlaintextFastUMLSProcessor.xml



Hi Yu,
I think this is a current limitation in cTAKES.  I think it has to do with 
negation not detecting if the line breaks are separating the sentences.

Would you mind forwarding the example to 
dev@ctakes.apache.orgmailto:dev@ctakes.apache.org?
I think Tim and others may be working on this issue.

--Pei

On Mon, Dec 15, 2014 at 3:54 PM, Liang, Yu 
yu.li...@nyumc.orgmailto:yu.li...@nyumc.org wrote:
On Dec 15, 2014, at 2:58 PM, Liang, Yu 
yu.li...@nyumc.orgmailto:yu.li...@nyumc.org wrote:

Hi Pei Chen,

Could you please look at the following example I run, I think the result is not 
accurate. The polarity of illness  is -1 but for fever, vomiting, diarrhea,and 
pain are all +1.

Also do you know is there any command line I can run to annotate like a 
thousand files automatically rather than copy and paster.

Yu Liang


[cid:DF19883E-B993-4CD0-90BD-F285A3C1A5A3@wireless.nyumc.org]
Yu Liang

CHIBI

RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-16 Thread Finan, Sean

Hi John,

Look for an Upload button in the upper-left corner next to a blue Sign in 
button.

Sean

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com] 
Sent: Tuesday, December 16, 2014 11:12 AM
To: dev@ctakes.apache.org
Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel 
Direct Link

That is, how do we upload videos *to the channel. *

On Tue, Dec 16, 2014 at 11:09 AM, John Green john.travis.gr...@gmail.com
wrote:

 How do we upload videos we wish to contribute? I dont have any 
 experience with youtube other than as a watcher.

 JG

 On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean  
 sean.fi...@childrens.harvard.edu wrote:

 Hmmm, I can't find it in a search.  However, here is a direct link:

 https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ

 Maybe it needs a few videos to register in the search engine ?

 Sean

 -Original Message-
 From: Pei Chen [mailto:chen...@apache.org]
 Sent: Monday, December 15, 2014 11:32 AM
 To: dev@ctakes.apache.org
 Subject: Re: intro video and ctakes youtube

 John,
 I presume you this thread:

 http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C39
 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E

 Strange, I couldn't find it anymore either... The place holder could 
 have been auto deleted because it was empty?  I think it's worth it 
 if you're willing to create and add to it again...

 ---Pei

 On Fri, Dec 12, 2014 at 11:46 PM, John Green 
 john.travis.gr...@gmail.com
 
 wrote:
 
  I was going to post some basic how to videos that help with the 
  learning curve I've walked over the last year and a half. I went 
  looking for ctakes youtube channel mentioned awhile back and I did 
  not
 find it...
 
  Anyone know where it went?
 
  Best,
  JG

RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-17 Thread Finan, Sean

Hmmm, well this is a ticker:

http://www.ampercent.com/upload-videos-youtube-channel-without-knowing-username-password/9374/

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com] 
Sent: Wednesday, December 17, 2014 2:08 PM
To: dev@ctakes.apache.org
Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel 
Direct Link

Isnt this to upload for my account? What about to the channel?

On Tue, Dec 16, 2014 at 12:16 PM, Finan, Sean  
sean.fi...@childrens.harvard.edu wrote:

 Hi John,

 Look for an Upload button in the upper-left corner next to a blue 
 Sign in button.

 Sean

 -Original Message-
 From: John Green [mailto:john.travis.gr...@gmail.com]
 Sent: Tuesday, December 16, 2014 11:12 AM
 To: dev@ctakes.apache.org
 Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes 
 Channel Direct Link

 That is, how do we upload videos *to the channel. *

 On Tue, Dec 16, 2014 at 11:09 AM, John Green 
 john.travis.gr...@gmail.com
 wrote:

  How do we upload videos we wish to contribute? I dont have any 
  experience with youtube other than as a watcher.

  JG

  On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean  
  sean.fi...@childrens.harvard.edu wrote:

  Hmmm, I can't find it in a search.  However, here is a direct link:

  https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ

  Maybe it needs a few videos to register in the search engine ?

  Sean

  -Original Message-
  From: Pei Chen [mailto:chen...@apache.org]
  Sent: Monday, December 15, 2014 11:32 AM
  To: dev@ctakes.apache.org
  Subject: Re: intro video and ctakes youtube

  John,
  I presume you this thread:

  http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C
  39 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E

  Strange, I couldn't find it anymore either... The place holder 
  could have been auto deleted because it was empty?  I think it's 
  worth it if you're willing to create and add to it again...

  ---Pei

  On Fri, Dec 12, 2014 at 11:46 PM, John Green 
  john.travis.gr...@gmail.com

  wrote:

   I was going to post some basic how to videos that help with the 
   learning curve I've walked over the last year and a half. I went 
   looking for ctakes youtube channel mentioned awhile back and I 
   did not
  find it...

   Anyone know where it went?

   Best,
   JG

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean

One quick mention:

The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.

Sean

-Original Message-
From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] 
Sent: Friday, December 19, 2014 1:28 PM
To: dev@ctakes.apache.org
Subject: RE: cTakes Annotation Comparison

Several thoughts:
1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.

2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder. 

3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.

Hope this makes sense...
--Guergana

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 1:16 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.

It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.

The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.



 [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy  
timothy.mil...@childrens.harvard.edu wrote:

 Thanks Kim,
 This sounds interesting though I don't totally understand it. Are you 
 saying that extraction performance for a given note depends on which 
 order the note was in the processing queue? If so that's pretty bad! 
 If you (or anyone else who understands this issue) has a concrete 
 example I think that might help me understand what the problem is/was.

 Even though, as Pei mentioned, we are going to try moving the 
 community to the faster dictionary, I would like to understand better 
 just to help myself avoid issues of this type going forward (and 
 verify the new dictionary doesn't use similar logic).

 Also, when we finish annotating the sample notes, might we use that as 
 a point of comparison for the two dictionaries? That would get around 
 the issue that not everyone has access to the datasets we used for 
 validation and others are likely not able to share theirs either. And 
 maybe we can replicate the notes if we want to simulate the scenario 
 Kim is talking about with thousands or more notes.

 Tim


 On 12/19/2014 10:24 AM, Kim Ebert wrote:
 Guergana,

 I'm curious to the number of records that are in your gold standard 
 sets, or if your gold standard set was run through a long running cTAKES 
 process.
 I know at some point we fixed a bug in the old dictionary lookup that 
 caused the permutations to become corrupted over time. Typically this 
 isn't seen in the first few records, but over time as patterns are 
 used the permutations would become corrupted. This caused documents 
 that were fed through cTAKES more than once to have less codes 
 returned than the first time.

 For example, if a permutation of 4,2,3,1 was found, the permutation 
 would be corrupted to be 1,2,3,4. It would no longer be possible to 
 detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
 fix in after the cTAKES 3.2.0 release. 
 https://issues.apache.org/jira/browse/CTAKES-310
 Depending upon the corpus size, I could see the permutation engine 
 eventually only have a single permutation of 1,2,3,4.

 Typically though, this isn't very easily detected in the first 100 or 
 so documents.

 We discovered this issue when we made cTAKES have consistent output of 
 codes in our system.

 [IMAT

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean

I’m bringing it up in case the Human Annotations were done using a different 
version.

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 1:40 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I don't think that would be an issue since both the rare word lookup and the 
first word lookup are using UMLS 2011AB. Or is the rare word lookup using a 
different dictionary?

I would expect roughly similar results between the two when it comes to 
differences between UMLS versions.

[IMAT Solutions]http://imatsolutions.com
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.commailto:greg.hub...@imatsolutions.com
On 12/19/2014 11:31 AM, Finan, Sean wrote:

One quick mention:

The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.

Sean

-Original Message-

From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]

Sent: Friday, December 19, 2014 1:28 PM

To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org

Subject: RE: cTakes Annotation Comparison

Several thoughts:

1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.

2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder.

3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.

Hope this makes sense...

--Guergana

-Original Message-

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]

Sent: Friday, December 19, 2014 1:16 PM

To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org

Subject: Re: cTakes Annotation Comparison

Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.

It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.

The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.

 [image: IMAT Solutions] http://imatsolutions.comhttp://imatsolutions.com  
Bruce Tietjen Senior Software Engineer

[image: Mobile:] 801.634.1547

bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy  
timothy.mil...@childrens.harvard.edumailto:timothy.mil...@childrens.harvard.edu
 wrote:

Thanks Kim,

This sounds interesting though I don't totally understand it. Are you

saying that extraction performance for a given note depends on which

order the note was in the processing queue? If so that's pretty bad!

If you (or anyone else who understands this issue) has a concrete

example I think that might help me understand what the problem is/was.

Even though, as Pei mentioned, we are going to try moving the

community to the faster dictionary, I would like to understand better

just to help myself avoid issues of this type going forward (and

verify the new dictionary doesn't use similar logic).

Also, when we finish annotating the sample notes, might we use that as

a point of comparison for the two dictionaries? That would get around

the issue that not everyone has access to the datasets we used for

validation and others are likely not able to share theirs either. And

maybe we can replicate the notes if we want to simulate the scenario

Kim is talking about with thousands or more notes.

Tim

On 12/19/2014 10:24 AM, Kim Ebert wrote:

Guergana,

I'm curious to the number of records that are in your gold standard

sets, or if your gold standard set was run through a long running cTAKES 
process.

I know at some point we fixed a bug in the old dictionary lookup that

caused

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean

Hi Bruce,

I'm not sure how there would be fewer matches with the overlap processor.  
There should be all of the matches from the non-overlap processor plus those 
from the overlap.  Decreasing from 215 to 211 is strange.  Have you done any 
manual spot checks on this?  It is really bizarre that you'd only have two 
matches per document (100 docs?).  

Thanks,
Sean

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 3:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I tried the configuration changes you mentioned in your earlier email.

The results are as follows:

Total Annotations found: 12,161 (default configuration found 8,284)

If counting exact span matches, this run only matched 211 (default 
configuration matched 215).

If counting overlapping spans, this run only matched 220 (default configuration 
matched 224)

Bruce



 [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei pei.c...@childrens.harvard.edu
wrote:

  Kim,

 Maintenance is the factor not bugs/issue to forge ahead.

 They are 2 components that do the same thing with the same goal (As 
 Sean mentioned, one should be able configure the new code base to  
 replicate the old algorithm if required- it’s just a simpler and 
 cleaner code base.  If this is not the case or if there are issues, we 
 should fix it and move forward.).

 We can keep the old component around for as long as needed, but it’s 
 likely going to have limited support…

 --Pei



 *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
 *Sent:* Friday, December 19, 2014 1:47 PM
 *To:* Chen, Pei; dev@ctakes.apache.org

 *Subject:* Re: cTakes Annotation Comparison



 Pei,

 I don't think bugs/issues should be part of determining if one 
 algorithm vs the other is superior. Obviously, it is worth mentioning 
 the bugs, but if the fast lookup method has worse precision and recall 
 but better performance, vs the slower but more accurate first word 
 lookup algorithm, then time should be invested in fixing those bugs 
 and resolving those weird issues.

 Now I'm not saying which one is superior in this case, as the data 
 will end up speaking for itself one way or the other; bus as of right 
 now, I'm not convinced yet that the old dictionary lookup is obsolete 
 yet, and I'm not sure the community is convinced yet either.



 [image: IMAT Solutions] http://imatsolutions.com

 *Kim Ebert*
 Software Engineer
 [image: Office:]801.669.7342
 kim.eb...@imatsolutions.com greg.hub...@imatsolutions.com

 On 12/19/2014 08:39 AM, Chen, Pei wrote:

 Also check out stats that Sean ran before releasing the new component on:


 http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
 fast/doc/DictionaryLookupStats.docx

 From the evaluation and experience, the new lookup algorithm should be 
 a huge improvement in terms of both speed and accuracy.

 This is very different than what Bruce mentioned…  I’m sure Sean will 
 chime here.

 (The old dictionary lookup is essentially obsolete now- plagued with 
 bugs/issues as you mentioned.)

 --Pei



 *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
 kim.eb...@perfectsearchcorp.com]
 *Sent:* Friday, December 19, 2014 10:25 AM
 *To:* dev@ctakes.apache.org
 *Subject:* Re: cTakes Annotation Comparison



 Guergana,

 I'm curious to the number of records that are in your gold standard 
 sets, or if your gold standard set was run through a long running cTAKES 
 process.
 I know at some point we fixed a bug in the old dictionary lookup that 
 caused the permutations to become corrupted over time. Typically this 
 isn't seen in the first few records, but over time as patterns are 
 used the permutations would become corrupted. This caused documents 
 that were fed through cTAKES more than once to have less codes 
 returned than the first time.

 For example, if a permutation of 4,2,3,1 was found, the permutation 
 would be corrupted to be 1,2,3,4. It would no longer be possible to 
 detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
 fix in after the cTAKES 3.2.0 release. 
 https://issues.apache.org/jira/browse/CTAKES-310
 Depending upon the corpus size, I could see the permutation engine 
 eventually only have a single permutation of 1,2,3,4.

 Typically though, this isn't very easily detected in the first 100 or 
 so documents.

 We discovered this issue when we made cTAKES have consistent output of 
 codes in our system.



 [image: IMAT Solutions] http://imatsolutions.com

 *Kim Ebert*
 Software Engineer
 [image: Office:]801.669.7342
 kim.eb...@imatsolutions.com greg.hub...@imatsolutions.com

 On 12/19/2014 07:05 AM, Savova, Guergana wrote:

 We are doing a similar kind of evaluation and will report the results.



 Before we released the Fast lookup, we did

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean

Hi Bruce,
 Correction -- So far, I did steps 1 and 2 of Sean's email.

No problem.  Aside from recreating the database, those two steps have the 
greatest impact.  But before you change anything else, please do some manual 
spot checks.  I have never seen a case where the lookup would be so horribly 
inaccurate.

Thanks

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 3:29 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Correction -- So far, I did steps 1 and 2 of Sean's email.


 [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen  
bruce.tiet...@perfectsearchcorp.com wrote:

 Sean,

 I tried the configuration changes you mentioned in your earlier email.

 The results are as follows:

 Total Annotations found: 12,161 (default configuration found 8,284)

 If counting exact span matches, this run only matched 211 (default 
 configuration matched 215).

 If counting overlapping spans, this run only matched 220 (default 
 configuration matched 224)

 Bruce



  [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen 
 Senior Software Engineer
 [image: Mobile:] 801.634.1547
 bruce.tiet...@imatsolutions.com

 On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei  
 pei.c...@childrens.harvard.edu wrote:

  Kim,

 Maintenance is the factor not bugs/issue to forge ahead.

 They are 2 components that do the same thing with the same goal (As 
 Sean mentioned, one should be able configure the new code base to  
 replicate the old algorithm if required- it’s just a simpler and 
 cleaner code base.  If this is not the case or if there are issues, 
 we should fix it and move forward.).

 We can keep the old component around for as long as needed, but it’s 
 likely going to have limited support…

 --Pei



 *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
 *Sent:* Friday, December 19, 2014 1:47 PM
 *To:* Chen, Pei; dev@ctakes.apache.org

 *Subject:* Re: cTakes Annotation Comparison



 Pei,

 I don't think bugs/issues should be part of determining if one 
 algorithm vs the other is superior. Obviously, it is worth mentioning 
 the bugs, but if the fast lookup method has worse precision and 
 recall but better performance, vs the slower but more accurate first 
 word lookup algorithm, then time should be invested in fixing those 
 bugs and resolving those weird issues.

 Now I'm not saying which one is superior in this case, as the data 
 will end up speaking for itself one way or the other; bus as of right 
 now, I'm not convinced yet that the old dictionary lookup is obsolete 
 yet, and I'm not sure the community is convinced yet either.



 [image: IMAT Solutions] http://imatsolutions.com

 *Kim Ebert*
 Software Engineer
 [image: Office:]801.669.7342
 kim.eb...@imatsolutions.com greg.hub...@imatsolutions.com

 On 12/19/2014 08:39 AM, Chen, Pei wrote:

 Also check out stats that Sean ran before releasing the new component on:


 http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup
 -fast/doc/DictionaryLookupStats.docx

 From the evaluation and experience, the new lookup algorithm should 
 be a huge improvement in terms of both speed and accuracy.

 This is very different than what Bruce mentioned…  I’m sure Sean will 
 chime here.

 (The old dictionary lookup is essentially obsolete now- plagued with 
 bugs/issues as you mentioned.)

 --Pei



 *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
 kim.eb...@perfectsearchcorp.com]
 *Sent:* Friday, December 19, 2014 10:25 AM
 *To:* dev@ctakes.apache.org
 *Subject:* Re: cTakes Annotation Comparison



 Guergana,

 I'm curious to the number of records that are in your gold standard 
 sets, or if your gold standard set was run through a long running cTAKES 
 process.
 I know at some point we fixed a bug in the old dictionary lookup that 
 caused the permutations to become corrupted over time. Typically this 
 isn't seen in the first few records, but over time as patterns are 
 used the permutations would become corrupted. This caused documents 
 that were fed through cTAKES more than once to have less codes 
 returned than the first time.

 For example, if a permutation of 4,2,3,1 was found, the permutation 
 would be corrupted to be 1,2,3,4. It would no longer be possible to 
 detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
 fix in after the cTAKES 3.2.0 release.
 https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the 
 corpus size, I could see the permutation engine eventually only have 
 a single permutation of 1,2,3,4.

 Typically though, this isn't very easily detected in the first 100 or 
 so documents.

 We discovered this issue when we made cTAKES have consistent output 
 of codes in our system.



 [image: IMAT Solutions] http://imatsolutions.com

 *Kim Ebert*

RE: cTakes Annotation Comparison --- (^:

2014-12-19 Thread Finan, Sean

Apologies accepted.  I'm really glad that you found the problem.

So what you are saying is (just to be very very clear to everybody reading this 
thread):

FastUMLSProcessor found 2795 matches (2,842 including overlaps)
While
 UMLSProcessor found 2632 matches (2,735 including overlaps)

--- So recall is BETTER in the fast lookup

And...
FastUMLSProcessor found 30,716 annotations
While
UMLSProcessor found 31,598 annotations

--- So precision is also looking BETTER in the fast lookup

Now maybe there will be a little more buy-in for the fast lookup.

Cheers,
Sean


-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 5:05 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

My apologies to Sean and everyone,

I am happy to report that I found a bug in our analysis tools that was missing 
the last FSArray entry for any FSArray list.

With the bug fixed, the results look MUCH better.

UMLSProcessor found 31,598 annotations
FastUMLSProcessor found 30,716 annotations

There were 23,522 annotations that were exact matches between the two.

When comparing with the gold standard annotations (4591 annotations):

UMLSProcessor found 2632 matches (2,735 including overlaps) FastUMLSProcessor 
found 2795 matches (2,842 including overlaps)






 [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen  
bruce.tiet...@perfectsearchcorp.com wrote:

 I'll do that -- there is always a possibility of bugs in the analysis 
 tool.


  [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen 
 Senior Software Engineer
 [image: Mobile:] 801.634.1547
 bruce.tiet...@imatsolutions.com

 On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean  
 sean.fi...@childrens.harvard.edu wrote:

  Sorry, I meant “Do some spot checks on the validity”.  In other 
 words, when your script reports that a cui and/or span is missing, 
 manually look at the data and see if it really is.  Just open up one 
 .xmi in the CVD and see what it looks like.



 Thanks,

 Sean



 *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
 *Sent:* Friday, December 19, 2014 3:37 PM
 *To:* dev@ctakes.apache.org
 *Subject:* Re: cTakes Annotation Comparison



 My original results were using a newly downloaded cTakes 3.2.1 with 
 the separately downloaded resources copied in. There were no changes 
 to any of the configuration files.

 As far as this last run, I modified the UMLSLookupAnnotator.xml and 
 AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified 
 ones I used (but they may not get through the mailing list).






 [image: Image removed by sender. IMAT Solutions] 
 http://imatsolutions.com

 *Bruce Tietjen*
 Senior Software Engineer
 [image: Image removed by sender. Mobile:]801.634.1547 
 bruce.tiet...@imatsolutions.com



 On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean  
 sean.fi...@childrens.harvard.edu wrote:

 Hi Bruce,

 I'm not sure how there would be fewer matches with the overlap 
 processor.  There should be all of the matches from the non-overlap 
 processor plus those from the overlap.  Decreasing from 215 to 211 is 
 strange.  Have you done any manual spot checks on this?  It is really 
 bizarre that you'd only have two matches per document (100 docs?).

 Thanks,
 Sean

 -Original Message-
 From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
 Sent: Friday, December 19, 2014 3:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes Annotation Comparison

 Sean,

 I tried the configuration changes you mentioned in your earlier email.

 The results are as follows:

 Total Annotations found: 12,161 (default configuration found 8,284)

 If counting exact span matches, this run only matched 211 (default 
 configuration matched 215).

 If counting overlapping spans, this run only matched 220 (default 
 configuration matched 224)

 Bruce



  [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen 
 Senior Software Engineer
 [image: Mobile:] 801.634.1547
 bruce.tiet...@imatsolutions.com

 On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei  
 pei.c...@childrens.harvard.edu
 wrote:
 
   Kim,
 
  Maintenance is the factor not bugs/issue to forge ahead.
 
  They are 2 components that do the same thing with the same goal (As 
  Sean mentioned, one should be able configure the new code base to 
  replicate the old algorithm if required- it’s just a simpler and 
  cleaner code base.  If this is not the case or if there are issues, 
  we should fix it and move forward.).
 
  We can keep the old component around for as long as needed, but 
  it’s likely going to have limited support…
 
  --Pei
 
 
 
  *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
  *Sent:* Friday, December 19, 2014 1:47 PM
  *To:* Chen, Pei; dev@ctakes.apache.org
 
  *Subject:* Re: cTakes Annotation Comparison

RE: Using cTakes programmatically

2014-12-29 Thread Finan, Sean

Hi Maite Meseure,

Check the cTakes User guide on UMLS setup:

https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide#cTAKES3.2UserInstallGuide-(Recommended)AddUMLSaccessrights

which (in part) points you towards obtaining a license to use the NIH UMLS 
dictionary:

https://uts.nlm.nih.gov/license.html



Sean


From: Maite Meseure Hugues [meseure.ma...@gmail.com]
Sent: Monday, December 29, 2014 4:17 PM
To: dev@ctakes.apache.org
Subject: Using cTakes programmatically

Dear all,

I allow myself to contact you in order to ask you how I can simply add
cTAKES packages in my java code to get the same output than the XML output
from the CPE (using clinical-pipeline/ test_plaintext.xml as descriptor).
I've explored and tested the cTakes example ( using
ClinicalPipelineFactory.getDefaultPipeline() ) but I've got this error
message:

[...] https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: maitemeseure

Exception in thread main
org.apache.uima.resource.ResourceInitializationException: Initialization of
annotator class
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator
failed.  (Descriptor: unknown)
Thanks a lot for your time.

Best regards

--
--
 Maïté Meseure Hugues

RE: Question about the pipeline

2015-02-02 Thread Finan, Sean

Hi Tol (and Maite),

I'm not entirely certain that I understand the question, but here is an attempt 
to help.  If I'm oversimplifying then I apologize.

I think that ExampleAggregatePipeline is intended to represent a very simple 
single-note pipeline and that custom code could be produced by using it as an 
example.

If you want to process texts in a directory, you can find with a web search 
plenty of ways to list files in a directory and read text from files.  
org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader might be what you 
used in the CPE, and you can certainly peruse the code and take what you need.  
Or, if you decide to write a simple diy,  here is one possibility:

Static public CollectionFile getFilesInDir( final File directory ) {
   final CollectionFile fileList = new ArrayList();
   final File[] fileList = directory.listFiles();
   if ( fileList == null ) {
  System.err.println( please check the directory  + 
directory.getAbsolutePath() );
  System.exit( 1 );
   }
for ( final File file : directory.listFiles() ) {
if ( file.canRead() ) {
fileList.add( file );
}
}
} 

Static public String getTextInFile( final File file ) throws IOException {   -- 
or handle ioE herein
   final Path nioPath = file.toPath();
   return new String( Files.readAllBytes( nioPath ) );
}

Static public void main( String ... args ) {
   If ( args[0].isEmpty() ) {
  System.out.println( Enter a directory path );
  System.exit( 0 );
   }
   Final CollectionFile files = getFilesInDir( new File( args[0] );
   For ( File file : files ) {
  Final String note = getTextInFile( file );
  ---  Insert here code a' la ExampleAggregatePipeline  ---
  ---  swap out the writer in ExampleAggregatePipeline with CasIOUtil 
method (below)  ---
   }
}

I must admit that I have never directly used it, but there is an xmi file 
writing method in org.apache.uima.fit.util.CasIOUtil named writeXmi( JCas jCas, 
File file ).  You could give this a try and see if it produces the type of 
output that you want.  The same utility class has a writeXCas(..) method.


If the above has absolutely nothing to do with your needs then please send me a 
bulleted list of items, example workflow, etc. and I'll see if I can be of 
service.

Oh, and I wrote the above code freehand, so MS Outlook is adding capital 
letters, etc.  If you cut and paste you'll need to change that - plus I haven't 
run/compiled, so there might be a typo or missed exception or something.  Or it 
may not work (in which case I'll throw in a little more effort).

Sean


-Original Message-
From: Tol O. [mailto:tol...@gmail.com] 
Sent: Monday, February 02, 2015 6:56 PM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

Maite Meseure Hugues meseure.maite@... writes:

 
 Hello all,
 
 Thank you for your preceding answers.
 I have a few questions regarding the pipeline example to run cTakes 
 programmatically.
 I am running ExampleAggregatePipeline.java with 
 ExampleHelloWorldAnnotator but I would like to know how I can change 
 it to run my data, as the CPE where we can choose the directory of our data.
 My second question is about the xml output generated with the CPE, can 
 I get the same xml output in using the example pipeline? and How?
 Thanks for your time.


I would like to ask the same question. After successfully setting up CTAKES 
following the Developers Guide I would also like to use a modified 
ExampleAggregatePipeline to output a CAS file identical to the output obtained 
by the CPE or the CVD when following the Users Guide.

This would be a great help for developers as a starting class to be able to 
programmatically obtain an annotated file based on a plaintext or XML input, 
same as through the two GUIs.

Right now I am reading through the Component Use Guide to replicate the CPE or 
the CVD tutorial with the test input, but it is a bit overwhelming.

Any pointers or suggestions would be really appreciated.

Tol O.

RE: Question about the pipeline

2015-02-03 Thread Finan, Sean

Hi Maite,

RunCPE is a good find, and if it fits your bil hten you should use it.  But it 
(if you mean the yTex class) doesn't take input and output directories from the 
command line.  It does take the path to a CPE.xml file.  There is a cTakes 
(non-yTex) equivalent named CmdLineCpeRunner.  Either one of them should print 
a usage if you run it without arguments.  As the CmdLineCpeRunner indicates, 
you can create a cpe .xml file with the cpe gui.  Basically, start the cpe gui, 
select your input (reader), output (writer) and pipeline (ae) in the gui and 
then save the cpe descriptor (via the menubar).  You can exit the gui and run 
either one of the cmd line utilities with the path to that cpe .xml descriptor 
as the argument.  Please note: sometimes you have to explicitly type .xml in 
the filename when saving with the cpe gui.  If you run with the cpe gui and 
then exit it should automatically ask you if you want to save the cpe .xml 
descriptor.  Anyway, once you have the .xml file you can always edit the input 
and output paths in that file to change your run parameters.  

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Tuesday, February 03, 2015 9:01 AM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

Thanks a lot Sean for your detailed reply. I've also found RunCPE.java that 
allows to put the input and outpur directories in arguments in the environment 
and do the same job than the CPE-GUI -at least in Eclipse, I haven't managed to 
run it via the command line yet.

On Mon, Feb 2, 2015 at 7:12 PM, Finan, Sean  sean.fi...@childrens.harvard.edu 
wrote:

 Hi Tol (and Maite),

 I'm not entirely certain that I understand the question, but here is 
 an attempt to help.  If I'm oversimplifying then I apologize.

 I think that ExampleAggregatePipeline is intended to represent a very 
 simple single-note pipeline and that custom code could be produced by 
 using it as an example.

 If you want to process texts in a directory, you can find with a web 
 search plenty of ways to list files in a directory and read text from 
 files.  org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader 
 might be what you used in the CPE, and you can certainly peruse the 
 code and take what you need.  Or, if you decide to write a simple diy,  
 here is one
 possibility:

 Static public CollectionFile getFilesInDir( final File directory ) {
final CollectionFile fileList = new ArrayList();
final File[] fileList = directory.listFiles();
if ( fileList == null ) {
   System.err.println( please check the directory  +
 directory.getAbsolutePath() );
   System.exit( 1 );
}
 for ( final File file : directory.listFiles() ) {
 if ( file.canRead() ) {
 fileList.add( file );
 }
 }
 }

 Static public String getTextInFile( final File file ) throws IOException
 {   -- or handle ioE herein
final Path nioPath = file.toPath();
return new String( Files.readAllBytes( nioPath ) ); }

 Static public void main( String ... args ) {
If ( args[0].isEmpty() ) {
   System.out.println( Enter a directory path );
   System.exit( 0 );
}
Final CollectionFile files = getFilesInDir( new File( args[0] );
For ( File file : files ) {
   Final String note = getTextInFile( file );
   ---  Insert here code a' la ExampleAggregatePipeline  ---
   ---  swap out the writer in ExampleAggregatePipeline with 
 CasIOUtil method (below)  ---
}
 }

 I must admit that I have never directly used it, but there is an xmi 
 file writing method in org.apache.uima.fit.util.CasIOUtil named 
 writeXmi( JCas jCas, File file ).  You could give this a try and see 
 if it produces the type of output that you want.  The same utility 
 class has a writeXCas(..) method.


 If the above has absolutely nothing to do with your needs then please 
 send me a bulleted list of items, example workflow, etc. and I'll see 
 if I can be of service.

 Oh, and I wrote the above code freehand, so MS Outlook is adding 
 capital letters, etc.  If you cut and paste you'll need to change that 
 - plus I haven't run/compiled, so there might be a typo or missed 
 exception or something.  Or it may not work (in which case I'll throw 
 in a little more effort).

 Sean


 -Original Message-
 From: Tol O. [mailto:tol...@gmail.com]
 Sent: Monday, February 02, 2015 6:56 PM
 To: dev@ctakes.apache.org
 Subject: Re: Question about the pipeline

 Maite Meseure Hugues meseure.maite@... writes:

 
  Hello all,
 
  Thank you for your preceding answers.
  I have a few questions regarding the pipeline example to run cTakes 
  programmatically.
  I am running ExampleAggregatePipeline.java with 
  ExampleHelloWorldAnnotator but I would like to know how I can change 
  it to run my data, as the CPE where we can choose the directory of 
  our
 data.
  My second question is about the xml output generated with the CPE, 
  can I get

RE: Question about the pipeline

2015-02-05 Thread Finan, Sean

Hi Maite,

Without more information I can't venture a guess as to a cause of the error.  
If RunCPE works then why not use that?  They are practically identical.

Sean

From: Maite Meseure Hugues [meseure.ma...@gmail.com]
Sent: Thursday, February 05, 2015 8:51 AM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

I see. In my case, I am using the CPE descriptor saved from the GUI for
CmdLineCpeRunner as said Sean. I've selected
AggregatePlaintextProcessor.xml as AE but I have this error:

Couldn't initialize processing engine.

  Initialization of CAS Processor with name AggregatePlaintextProcessor
failed. 

Meanwhile, RunCPE.java works properly with the same descriptor in Eclipse.
Does anyone have an idea?

On Wed, Feb 4, 2015 at 12:56 PM, Lingren, Todd todd.ling...@cchmc.org
wrote:

 Hi Maite,
 For each patient in my list, I create a new FilesToFiles CPE xml using
 some sed commands on the template original.

 Specifically, here's the command line argument (I'm on linux).

 CTAKES_HOME=...
 java -cp $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/
 -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms512M -Xmx2048M
 CmdLineCpeRunner FilesToFiles_patient_cui.xml  outputfile.txt

 I don't think it matters, but I'm using the cTAKES 3.1.0 version.

 Todd Lingren
 Biomedical Informatics
 Cincinnati Children’s Hospital
 todd.ling...@cchmc.org
 513-803-9032

 -Original Message-
 From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
 Sent: Wednesday, February 04, 2015 12:59 PM
 To: dev@ctakes.apache.org
 Subject: Re: Question about the pipeline

 Interesting, Todd thank you and how do you use CMdLineCpeRunner basically?
 Because I tested in cmd line with:

 java org.apache.ctakes.core.cpe.CmdLineCpeRunner [path-to-my-cpe.xml]

 but here is that I've got:

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/uima/util/InvalidXMLException

 at java.lang.Class.getDeclaredMethods0(Native Method)

 at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)

 at java.lang.Class.privateGetMethodRecursive(Class.java:3040)

 at java.lang.Class.getMethod0(Class.java:3010)

 at java.lang.Class.getMethod(Class.java:1776)

 at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)

 at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)

 ...

 On Wed, Feb 4, 2015 at 8:32 AM, Lingren, Todd todd.ling...@cchmc.org
 wrote:

  Sean and Maite,
  FWIW, I use CmdLineCpeRunner frequently. I employ it with a bash
  script to automatically create a new xml file based on the subfolder
  names contained in the target directory. So in our HPC, it spawns a
  new job for each subfolder (which may have between 5 and 2500 notes).

  Todd Lingren
  Biomedical Informatics
  Cincinnati Children’s Hospital
  todd.ling...@cchmc.org
  513-803-9032

  -Original Message-
  From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
  Sent: Tuesday, February 03, 2015 2:47 PM
  To: dev@ctakes.apache.org
  Subject: RE: Question about the pipeline

  Hi Maite,

  RunCPE is a good find, and if it fits your bil hten you should use it.
  But it (if you mean the yTex class) doesn't take input and output
  directories from the command line.  It does take the path to a CPE.xml
  file.  There is a cTakes (non-yTex) equivalent named CmdLineCpeRunner.
  Either one of them should print a usage if you run it without arguments.
  As the CmdLineCpeRunner indicates, you can create a cpe .xml file with
  the cpe gui.  Basically, start the cpe gui, select your input
  (reader), output
  (writer) and pipeline (ae) in the gui and then save the cpe descriptor
  (via the menubar).  You can exit the gui and run either one of the cmd
  line utilities with the path to that cpe .xml descriptor as the argument.
  Please note: sometimes you have to explicitly type .xml in the
  filename when saving with the cpe gui.  If you run with the cpe gui
  and then exit it should automatically ask you if you want to save the
 cpe .xml descriptor.
  Anyway, once you have the .xml file you can always edit the input and
  output paths in that file to change your run parameters.

  Sean

  -Original Message-
  From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
  Sent: Tuesday, February 03, 2015 9:01 AM
  To: dev@ctakes.apache.org
  Subject: Re: Question about the pipeline

  Thanks a lot Sean for your detailed reply. I've also found RunCPE.java
  that allows to put the input and outpur directories in arguments in
  the environment and do the same job than the CPE-GUI -at least in
  Eclipse, I haven't managed to run it via the command line yet.

  On Mon, Feb 2, 2015 at 7:12 PM, Finan, Sean 
  sean.fi...@childrens.harvard.edu wrote:

   Hi Tol (and Maite),

   I'm not entirely certain that I understand the question, but here is
   an attempt to help.  If I'm oversimplifying then I apologize.

   I think

RE: git mirrors out of sync?

2015-02-03 Thread Finan, Sean

Hi Steve,

You are right (confirming your finding) - it looks like the first is a no-show 
and the second is somebody's personal upload to github (not git.apache.org) 
from 3 years ago.  The jira claims that the item was closed (fixed), but if you 
go to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__git.apache.org_d=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=6K2jncop0hjH-CSVJRe1t5Ehv0V75znADU0wtfGz_1wm=NERTSV05Tazy9bLFr0JnQeCe6FcppzevqkKgecLBfhAs=hg28ET1-cmNSr9e9uZcva97I5GEgyQGtYqBF1BKSQxUe=
  cTakes is not listed.  Was it there previous to 6 days ago but removed? 

If nobody responds with a here's yer problem by end of week then I ( or you, 
if you like) will ping infra.  I know that at least one contributor (not me) 
prefers to use git.

Sean

-Original Message-
From: Steven Bethard [mailto:steven.beth...@gmail.com] 
Sent: Tuesday, February 03, 2015 3:38 PM
To: dev@ctakes.apache.org
Subject: git mirrors out of sync?

The git mirrors for cTAKES seem to be either broken 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__git.apache.org_ctakes.gitd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fss=YbXZ5LN-Z295poj6jlkGInSjv6t78b2X0QgO8hI0vwke=
 ) or embarrassingly out of sync 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakesd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fss=YW6_xp81csYAksST2pDnIUjQEEI7rmK60iN9NDYO3cge=
 ). Is this a known issue? I looked at the INFRA ticket [1], but didn't see 
anything that suggested that there should be a problem.

Steve

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_INFRA-2D8553d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fss=-ZNPLIX5GcrgNmQwjs8qmXU8rG_D8de7ymM9_y3gPPMe=

RE: Question about the pipeline

2015-02-03 Thread Finan, Sean

Hi Tol,

 Essentially, I want to know how to set up the cTAKES objects correctly into a 
 pipeline in a Java programs, so that medical texts are annotated, like the 
 GUI is doing. I would really appreciate any hints or how to accomplish this.

Looking at your embedded code I think that you've got the general idea of how 
to do everything.  Perhaps you are wondering how to create custom pipelines by 
programmatically adding chosen processors?

Tim Miller made a great addition (imo) to the cTakes code with the 
org.apache.ctakes.clinicalpipeline. ClinicalPipelineFactory class.  Perhaps you 
can take a look at that and see if it helps?

Sean

-Original Message-
From: Tol O. [mailto:tol...@gmail.com] 
Sent: Tuesday, February 03, 2015 7:35 PM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline



Sean,

Thank you for the detailed reply.

As you mentioned, I had to revert the capital letters from your Outlook, and 
also, if somebody else wants to use the code and cannot get it to run: the 
getFilesInDir method needs to return the populated CollectionFile fileList, 
the variable final File[] fileList and its usage should be renamed to something 
else (as the variable name already exists) and the main method needs to throw 
an IOException.

I think these were all the changes I made so that the txt files from a folder 
are added to the collection, many thanks again.

What I am looking to do is also what the description in 
ExampleAggregatePipeline says, running a pipeline programatically w/o uima 
xml descriptor xml files. This is accomplished by what I understand the 
uimaFIT classes, so that AEs can be defined in Java, added to a Pipeline and 
directly run.

The uimaFIT page gives a nice Java snippet that uses uimaFIT in a similar way 
as the cTAKES example, I pasted the few Java lines below at [1]. 
https://urldefense.proofpoint.com/v2/url?u=http-3A__uima.apache.org_d_uimafit-2Dcurrent_tools.uimafit.book.html-23ugr.tools.uimafit.introductiond=BQICAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=uhPMXYD_U8cpnenfJCFigx00DCavTuwRGY-irX80FfUs=4s5P35eByjHcLHM6WEp5jmjquPc-wynEgjBWnY6I6Pge=
 

I would like to use cTAKES in my own Java programs such that, just like the 
ExampleAggregatePipeline, uimaFIT can be used create and run a cTAKES pipeline 
to annotate medical texts. Then, I could also output the result in CAS files, 
just like the CVD GUI is doing. This would allow to directly be able to add or 
modify my own AnalysisEngines.

Essentially, I want to know how to set up the cTAKES objects correctly into a 
pipeline in a Java programs, so that medical texts are annotated, like the GUI 
is doing. I would really appreciate any hints or how to accomplish this. 

Following your code example to read the files the outlined idea is:

for ( File file : files ) {
  Final String note = getTextInFile( file );
  JCas jCas = JCasFactory.createJCas();
  jCas.setDocumentText(note);

  // 1. create the AnalysisEngines for tokenizer, tagger and other cTAKES 
components etc. to annotate medical texts
  // 2. runPipeline(jCas, ...);
}

[1]
The code snippet from uimaFIT:

JCas jCas = JCasFactory.createJCas();

jCas.setDocumentText(some text);

AnalysisEngine tokenizer = createEngine(MyTokenizer.class);

AnalysisEngine tagger = createEngine(MyTagger.class);

runPipeline(jCas, tokenizer, tagger);

for(Token token : iterate(jCas, Token.class)){
System.out.println(token.getTag());
}

Tol O.


Finan, Sean Sean.Finan@... writes:

 
 Hi Tol (and Maite),
 
 I'm not entirely certain that I understand the question, but here is 
 an
attempt to help.  If I'm
 oversimplifying then I apologize.
 
 I think that ExampleAggregatePipeline is intended to represent a very
simple single-note pipeline and
 that custom code could be produced by using it as an example.
 
 If you want to process texts in a directory, you can find with a web
search plenty of ways to list files in a
 directory and read text from files. 
org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader
 might be what you used in the CPE, and you can certainly peruse the 
 code
and take what you need.  Or, if you
 decide to write a simple diy,  here is one possibility:
 
 Static public CollectionFile getFilesInDir( final File directory ) {
final CollectionFile fileList = new ArrayList();
final File[] fileList = directory.listFiles();
if ( fileList == null ) {
   System.err.println( please check the directory  +
directory.getAbsolutePath() );
   System.exit( 1 );
}
 for ( final File file : directory.listFiles() ) {
 if ( file.canRead() ) {
 fileList.add( file );
 }
 }
 }
 
 Static public String getTextInFile( final File file ) throws 
 IOException {
  -- or handle ioE herein
final Path nioPath = file.toPath();
return new String( Files.readAllBytes( nioPath ) ); }
 
 Static public void main( String ... args ) {
If ( args[0

RE: BagOfCuisGenerator.java, same idea for getConceptText()

2015-02-12 Thread Finan, Sean

Try something like the following for output:

   private int extractFeatures( final IdentifiedAnnotation annotation )  {
  // Extract the IdentifiedAnnotation itself
  final CollectionString umlsInfos = getUmlsInfos( annotation, 
_printSnomed );
  if ( umlsInfos  == null ) {
 return 0;
  }
  final int begin = annotation.getBegin();
  final int end = annotation.getEnd();
  final String annotationText = annotation.getCoveredText();
  final int polarity = annotation.getPolarity();
  int count = 0;
  for ( String umlsInfo : umlsInfos ) {
 saveAnnotation( annotationText, umlsInfo, polarity, begin, end );
 count++;
  }
  return count;
   }

   static private CollectionString getUmlsInfos( final IdentifiedAnnotation 
identifiedAnnotation ) {
  final FSArray fsArray = identifiedAnnotation.getOntologyConceptArr();
  if ( fsArray == null ) {
 return Collections.emptySet();
  }
  final FeatureStructure[] featureStructures = fsArray.toArray();
  final SetString umlsInfos = new HashSetString( 
featureStructures.length );
  for ( FeatureStructure featureStructure : featureStructures ) {
 final OntologyConcept ontologyConcept = (OntologyConcept) 
featureStructure;
 String info = null;
 if ( ontologyConcept instanceof UmlsConcept ) {
final UmlsConcept umlsConcept = (UmlsConcept) ontologyConcept;
info = umlsConcept.getCui();
final String tui = umlsConcept.getTui();
if ( tui != null  !tui.isEmpty() ) {
   info += _ + tui;
}
final String preferredText = umlsConcept.getPreferredText();
if ( preferredText != null  !preferredText.isEmpty() ) {
   info +=  = \ + preferredText + \;
}
umlsInfos.add( info );
 }
  }
  return umlsInfos;
   }

   public void saveAnnotation( final String spannedText, final String umlsInfo, 
final int polarity,
   final int begin, final int end  )  {
  final String text = begin + , + end +   + (polarity  0 ? - :  ) 
+ umlsInfo +   + spannedText;
  if ( _writer == null ) {
 System.out.println( text );
 return;
  }
  try {
 _writer.write( text );
 _writer.newLine();
  } catch ( IOException ioE ) {
 logger.error( ioE.getMessage() );
  }
   }
-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, February 12, 2015 2:46 PM
To: dev@ctakes.apache.org
Subject: BagOfCuisGenerator.java, same idea for getConceptText()

Hi everyone,

I am currently working on BagOfCuisGenerator, and I would like to add the 
concept text to the output.
I 've seen some discussions about getting the original text and UMLS preferred 
text in addition to the cui. Can someone give me pointers to do that?
Thanks in advance for your time.

Maite

--
--
 Maïté Meseure Hugues

RE: BagOfCuisGenerator.java, same idea for getConceptText()

2015-02-12 Thread Finan, Sean

Oh yeah - use the -fast dictionary to get preferred text.  The fastest way to 
get cuis only is with CuisOnlyPlaintextUMLSProcessor.  If you want polarity 
make sure you uncomment the section with PolarityCleartkAnalysisEngine.

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, February 12, 2015 2:46 PM
To: dev@ctakes.apache.org
Subject: BagOfCuisGenerator.java, same idea for getConceptText()

Hi everyone,

I am currently working on BagOfCuisGenerator, and I would like to add the 
concept text to the output.
I 've seen some discussions about getting the original text and UMLS preferred 
text in addition to the cui. Can someone give me pointers to do that?
Thanks in advance for your time.

Maite

--
--
 Maïté Meseure Hugues

RE: BagOfCuisGenerator.java, same idea for getConceptText()

2015-02-17 Thread Finan, Sean

Hi Maite,

I just checked the log and it looks like you'll need to use a copy of cTakes 
built after 12/08/2014 to get Snomed codes.

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Monday, February 16, 2015 12:19 PM
To: dev@ctakes.apache.org
Subject: Re: BagOfCuisGenerator.java, same idea for getConceptText()

Sean, I have a question, is it because I am using fast dictionary I don't get 
snomed-oid or snomed-code? Instead, it's  snomed_oid: null#CTAKES.
Thank you.

Maite

On Fri, Feb 13, 2015 at 1:32 PM, Maite Meseure Hugues  
meseure.ma...@gmail.com wrote:

 Thank you for your replies, It's helpful. I was working on 3.2.0 
 version, so it looks like 3.2.1 allows to get the UMLS preferred text.

 Maite

 On Thu, Feb 12, 2015 at 2:25 PM, Finan, Sean  
 sean.fi...@childrens.harvard.edu wrote:

 Oh yeah - use the -fast dictionary to get preferred text.  The 
 fastest way to get cuis only is with CuisOnlyPlaintextUMLSProcessor.  
 If you want polarity make sure you uncomment the section with 
 PolarityCleartkAnalysisEngine.

 Sean

 -Original Message-
 From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
 Sent: Thursday, February 12, 2015 2:46 PM
 To: dev@ctakes.apache.org
 Subject: BagOfCuisGenerator.java, same idea for getConceptText()

 Hi everyone,

 I am currently working on BagOfCuisGenerator, and I would like to add 
 the concept text to the output.
 I 've seen some discussions about getting the original text and UMLS 
 preferred text in addition to the cui. Can someone give me pointers 
 to do that?
 Thanks in advance for your time.

 Maite

 --
 --
  Maïté Meseure Hugues




 --
 --
  Maïté Meseure Hugues




--
--
 Maïté Meseure Hugues

RE: CTAKES mirroring on github.

2015-02-17 Thread Finan, Sean

Our request is for a read-only mirror.  However, if it ever becomes i/o, I 
don't know if this will have what you want, but http://git.apache.org/
Links to documentation (mostly server setup) http://www.apache.org/dev/git.html 
and a wiki (check toward middle and bottom for committer info) 
https://wiki.apache.org/general/GitAtApache



-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Tuesday, February 17, 2015 12:31 PM
To: dev@ctakes.apache.org
Subject: Re: CTAKES mirroring on github.

Is there any existing resource to help people who want to use git understand 
the right workflow to contribute to ctakes? (i.e. how this interacts with svn 
repos).
Tim


On 02/17/2015 12:23 PM, jay vyas wrote:
 Hi CTakes.  Looks like infra finally got  onto the JIRA i made for 
 this a while back.  They are currently working on fixing a couple of 
 minor glitches w/ the mirroring (not showing all commits)... but there 
 now is a mirror for CTakes on github.


 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
 _ctakesd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=Heup-
 IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674hm=4sEI9mOp
 kTz6K-DjmNU1s8Do1TGA0_10HqJcowKpDxcs=fNVbyXzpBLSAG6-DIjBZ1vbMp0JGaX90
 Lcdzg_EFVvMe=

RE: Question about fast pipeline

2015-01-12 Thread Finan, Sean

Hi Michelle,

Did your error have only Could not find . as absolute or did it also have 
or in ... or in ...?  If you see  ... or in ...  then this is a new issue.  
If you don't, then you should update your source.  If you need to run the 
release binary then let me know and I can work out sending you a patch.

Sean

-Original Message-
From: michelle1919c...@gmail.com [mailto:michelle1919c...@gmail.com] On Behalf 
Of Michelle Chen
Sent: Monday, January 12, 2015 4:30 PM
To: dev@ctakes.apache.org
Subject: Question about fast pipeline

I'm fairly new to using cTAKES and was trying to figure out how to use the fast 
pipeline in my Java code.

I was able to run the code in Clinical Pipeline Factory with both the default 
Pipeline and the fast Pipeline. However, when I tried incorporating 
getDefaultPipeline, I get these errors:

ERROR JdbcConnectionFactory - Could not find 
resources/org/apache/ctakes/dictionary/lookup/fast/ctakessnorx/ctakessnorx.script
as absolute.

ERROR JdbcRareWordDictionary - Could not Connect to Dictionary UmlsHsqlRareWord

Has anyone else encountered this before? Is there something that I should be 
linking that I forgot to reference? Or do I just need to update the resources 
folder again?

Thank you.

---
Michelle Chen

RE: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison

2015-01-09 Thread Finan, Sean

Hi James,
Great question.  In truth, you may need to run a few times to find out.  Doing 
that with a full pipeline would be tedious, but there is a descriptor in 
clinical-pipeline named CuisOnlyPlaintextUMLSProcessor.xml that will only 
obtain Umls cuis.  It runs ~50,000 notes per hour on my laptop as-is, so I 
suggest that you test with that ae.  It has lvg commented out by default (for 
speed).  Adding lvg will increase the runtime, but it also will (as you know) 
find a few additional terms.   You can try a few configurations without it and 
then the best option with it.  If you want to test the default dictionary 
lookup then you can certainly swap the referenced lookup xmls.

Changes to the fast dictionary configuration are made in two places:
1.  The main descriptor ...-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
2.  The resource (dictionary) configuration file 
resources/.../fast/cTakesHsql..xml

A few suggestions, in order of impact:
1.  I am guessing that the annotations in clef are human annotated with 
longest-length spans only.  In other words, colon cancer instead of  colon 
cancer and cancer.  To best approximate this style of annotation, edit the 
cTakesHsql.xml in the section rareWordConsumer and change the selected 
implementation.  By default it is DefaultTermConsumer (go figure), but you will 
want to use the commented-out PrecisionTermConsumer.  As the above cTakesHsql 
comment indicates  DefaultTermConsumer will persist all spans.
   PrecisionTermConsumer will only persist only the longest overlapping span of 
any semantic group.  Doing this should increase precision, and depending upon 
how good the annotations are it should not greatly change recall.

2. Just for kicks, try using SemanticCleanupTermConsumer.  It may slightly 
increase precision, but it also may decrease recall.  Hopefully it doesn't do 
much at all (PrecisionTermConsumer and proper semantic typing in the dictionary 
should suffice without this term consumer).

3. Especially for task 2 (acronyms  abbreviations), you should try a run with 
nameminimumSpan/name in UmlsLookupAnnotator.xml set to 2.   This changes 
the minimum allowable span of a term.  The default is 3 to increase precision 
on acronyms  abbreviations, but decreasing to 2 may improve recall on the 
same.   The dictionary is not built with anything below 2 characters.
4.  On that note (character length), if task 1 does not include acronyms  
abbreviations, then you can try increasing the minimum span length above 3 and 
see if there is a good increase in precision without a significant decrease in 
recall.

5.  Try a few runs with overlapping spans in addition to exact matches.  To do 
this use the OverlapJCasTermAnnotator instead of the DefaultJCasTermAnnotator 
annotator implementation.  DefaultJCasTermAnnotator is specified in 
UmlsLookupAnnotator.xml  but I will check in a descriptor for overlap matching. 
 There are additional parameters for that option, but I'll email  them after I 
checkin.

6.  By default the new lookup uses Sentence as the lookup window.  I did this 
for two reasons: 1. Not all terms are within Noun Phrases, 2. Some Noun Phrases 
overlapped, causing repeated lookups (in my 3.0 candidate trials), and 3. Not 
all cTakes Noun Phrases are accurate.  Because the lookup is fast, using a full 
Sentence for lookup doesn't seem to hurt much.  However, you can always switch 
it back to see if precision is increased enough to warrant the decrease in 
recall.  This is changed in UmlsLookupAnnotator.xml

I have run my own tests with the various setups, but I don't want to adversely 
influence what you run just in case the trends with the share/clef annotations 
differ.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, January 09, 2015 3:57 PM
To: 'dev@ctakes.apache.org'
Subject: dictionary lookup config for best F1 measure [was RE: cTakes 
Annotation Comparison

Sean (or others), 

Of the various configuration options described below, which values/choices 
would you recommend for best F1 measure for something like the shared clef 2013 
task?
https://sites.google.com/site/shareclefehealth/

I'm looking for something that doesn't have to be the best speed-wise, but that 
is the recommended for optimizing F1 measure.

Regards,
James 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, December 19, 2014 11:55 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed

RE: Question about the pipeline

2015-02-05 Thread Finan, Sean

Hi Maite,

If you can run the cpe gui using the script in bin/ , try specifying the 
descriptor for that:

runctakesCPE -desc pathToXml

If that runs then try copying the runctakesCPE to something like runctakesCLI 
and change the last line of the file to call CmdLineCpeRunner instead of 
CpmFrame.

Sean

p.s. check the last line of runctakesCPE script that you are using and make 
sure that it passes arguments: %* for Windows or $@ for unix/linux

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, February 05, 2015 9:42 AM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

Yes, it does but only in Eclipse, not in command line even though I am in the 
good directory. I have to look at the classpath more in details probably.
Thanks for your replies.

On Thu, Feb 5, 2015 at 8:08 AM, Finan, Sean  sean.fi...@childrens.harvard.edu 
wrote:

 Hi Maite,

 Without more information I can't venture a guess as to a cause of the 
 error.  If RunCPE works then why not use that?  They are practically 
 identical.

 Sean
 
 From: Maite Meseure Hugues [meseure.ma...@gmail.com]
 Sent: Thursday, February 05, 2015 8:51 AM
 To: dev@ctakes.apache.org
 Subject: Re: Question about the pipeline

 I see. In my case, I am using the CPE descriptor saved from the GUI 
 for CmdLineCpeRunner as said Sean. I've selected 
 AggregatePlaintextProcessor.xml as AE but I have this error:

 Couldn't initialize processing engine.

   Initialization of CAS Processor with name AggregatePlaintextProcessor
 failed. 

 Meanwhile, RunCPE.java works properly with the same descriptor in Eclipse.
 Does anyone have an idea?

 On Wed, Feb 4, 2015 at 12:56 PM, Lingren, Todd 
 todd.ling...@cchmc.org
 wrote:

  Hi Maite,
  For each patient in my list, I create a new FilesToFiles CPE xml 
  using some sed commands on the template original.
 
  Specifically, here's the command line argument (I'm on linux).
 
  CTAKES_HOME=...
  java -cp 
  $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/
  -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms512M
 -Xmx2048M
  CmdLineCpeRunner FilesToFiles_patient_cui.xml  outputfile.txt
 
  I don't think it matters, but I'm using the cTAKES 3.1.0 version.
 
 
  Todd Lingren
  Biomedical Informatics
  Cincinnati Children’s Hospital
  todd.ling...@cchmc.org
  513-803-9032
 
 
  -Original Message-
  From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
  Sent: Wednesday, February 04, 2015 12:59 PM
  To: dev@ctakes.apache.org
  Subject: Re: Question about the pipeline
 
  Interesting, Todd thank you and how do you use CMdLineCpeRunner
 basically?
  Because I tested in cmd line with:
 
  java org.apache.ctakes.core.cpe.CmdLineCpeRunner 
  [path-to-my-cpe.xml]
 
  but here is that I've got:
 
 
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/uima/util/InvalidXMLException
 
  at java.lang.Class.getDeclaredMethods0(Native Method)
 
  at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
 
  at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
 
  at java.lang.Class.getMethod0(Class.java:3010)
 
  at java.lang.Class.getMethod(Class.java:1776)
 
  at 
  sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:54
  4)
 
  at 
  sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526
  )
 
  ...
 
  On Wed, Feb 4, 2015 at 8:32 AM, Lingren, Todd 
  todd.ling...@cchmc.org
  wrote:
 
   Sean and Maite,
   FWIW, I use CmdLineCpeRunner frequently. I employ it with a bash 
   script to automatically create a new xml file based on the 
   subfolder names contained in the target directory. So in our HPC, 
   it spawns a new job for each subfolder (which may have between 5 and 2500 
   notes).
  
   Todd Lingren
   Biomedical Informatics
   Cincinnati Children’s Hospital
   todd.ling...@cchmc.org
   513-803-9032
  
  
   -Original Message-
   From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
   Sent: Tuesday, February 03, 2015 2:47 PM
   To: dev@ctakes.apache.org
   Subject: RE: Question about the pipeline
  
   Hi Maite,
  
   RunCPE is a good find, and if it fits your bil hten you should use it.
   But it (if you mean the yTex class) doesn't take input and output 
   directories from the command line.  It does take the path to a 
   CPE.xml file.  There is a cTakes (non-yTex) equivalent named 
   CmdLineCpeRunner.
   Either one of them should print a usage if you run it without
 arguments.
   As the CmdLineCpeRunner indicates, you can create a cpe .xml file 
   with the cpe gui.  Basically, start the cpe gui, select your input 
   (reader), output
   (writer) and pipeline (ae) in the gui and then save the cpe 
   descriptor (via the menubar).  You can exit the gui and run either 
   one of the cmd line utilities with the path to that cpe .xml 
   descriptor as the
 argument.
   Please note: sometimes you have

RE: Negex

2015-01-05 Thread Finan, Sean

I don't know.  I'm comparing what I think is the 2009 negex trigger set 
https://code.google.com/p/negex/source/browse/trunk/GeneralNegEx.Java.v.1.2.05092009/negex_triggers.txt

with the cTakes trigger set in 
org.apache.ctakes.core.fsm.machine.NegationFSM.java and it looks like the 
cTakes set is missing some 2009 negex trigger words, such as exhibit.

Anyway, you can read 
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+NE+Contexts for 
info on adding triggers to the cTakes version.

Sean

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com] 
Sent: Monday, January 05, 2015 2:03 PM
To: dev@ctakes.apache.org
Cc: dev@ctakes.apache.org
Subject: Re: Negex

Thanks Ma'am for the input!


So to clarify: ctakes added additional trigger words to the list published 
originally? (This is an unrelated question to the negex vs ml thread last 
month).




Best,

John
—
Sent from Mailbox

On Mon, Jan 5, 2015 at 12:58 PM, Green, John john.gr...@usuhs.edu wrote:

 Hi all - Does anyone know off the top of their head if the negex 
 trigger rules included in the original 2009 python script were added 
 to when it was implemented in ctakes?
 Thanks,
 John

RE: Question about CPE/ descriptor and xml file.

2015-01-05 Thread Finan, Sean

Go through the error that you got, and look for a message like:

Failed to initilize.  Invalid UMLS License

and

Error: Invalid UMLS License.  A UMLS License is required to use the UMLS 
dictionary lookup. 
Error: You may request one at: https://uts.nlm.nih.gov/license.html 
Please verify your UMLS license settings in the 
DictionaryLookupAnnotatorUMLS.xml configuration.

If you see that message, you see a possible solution.  If you have a umls 
username and password, make sure that they are set correctly for the cTakes run.

If you don't see that message, check 
resources/org/apache/ctakes/dictionary/lookup/umls2011ab/umls and see if it 
contains a rather large .data file.  If not, then go through the process 
detailed at http://ctakes.apache.org/downloads.cgi in the section entitled 
Resources.

If you have the .data file, then let us know and we'll try to push forward.

Sean


-Original Message-
From: Maite Meseure Hugues [mailto:mmhug...@medmergent.com] 
Sent: Monday, January 05, 2015 9:33 AM
To: dev@ctakes.apache.org
Subject: Question about CPE/ descriptor and xml file.

Hello everyone,


I am a new user of cTakes and I would like to integrate it in my code to run it 
programmatically.

I followed the example in the cTakes package but I have an error message 
regarding the descriptor:


[...] 03 Jan 2015 13:39:33  INFO UmlsDictionaryLookupAnnotator - Using 
ctakes.umlsaddr: https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: 
maitemeseure

Exception in thread main 
org.apache.uima.resource.ResourceInitializationException: Initialization of 
annotator class 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator failed.  
(Descriptor: unknown)

Do you know how I can fix that?? My goal is to get in output the same XML file 
than the CPE.
Thanks a lot for your time.

Best regards,

Maite Meseure

RE: Is it necessary to put UMLS login into files when passing them with -D to the JVM?

2015-03-06 Thread Finan, Sean

Hi Tom,

 I am passing my UMLS login and password on startup as arguments ... 
 -Dctakes.umlsuser=myusername -Dctakes.umlspw=mypassword
That is fine.  If I understand correctly you are already running this way 
without problem.  The comments in the .xml files should probably be extended to 
include mention of the cmd parameters.


 [I] downloaded [AggregatePlaintextFastUmlsProcessor.xml] from the svn and 
 replaced the old cTAKES 3.2.1 ...
I think that this should be fine.  Java code for each annotator may have 
changed, but I don't think that any class names (by which annotators are 
called) have changed.  The best way to know for certain is to run it, and if 
you haven't seen any problems then I think that you are in good shape.

Sean

-Original Message-
From: Tom Devel [mailto:deve...@gmail.com] 
Sent: Friday, March 06, 2015 3:20 PM
To: dev@ctakes.apache.org
Subject: Is it necessary to put UMLS login into files when passing them with -D 
to the JVM?

Hi,

in AggregatePlaintextFastUMLSProcessor.xml of cTAKES it states that:

[...] Please update DictionaryLookupAnnotatorUMLS.xml file with your UMLS 
username and password.

Similarly, in AggregatePlaintextFastUMLSProcessor.xml from 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D344d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=ZOef73O4fpDF9CZPAZHmVyDZDQDa6jKWyTTU1kikj9os=7C1osQzBp5-aSIXPeqWPXcafrLDGCeEkR3sfbiJMRDQe=
 

[...] Please update
resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml file with 
your UMLS username and password

I am passing my UMLS login and password on startup as arguments, when starting 
the either CVD/CPE or org.apache.uima.examples.cpe.SimpleRunCPE
argumets such as:

-Dctakes.umlsuser=myusername -Dctakes.umlspw=mypassword

In such a case, it is still necessary to modify the file(s) above?

Additional question: It seems that the
AggregatePlaintextFastUMLSProcessor.xml from 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D344d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=ZOef73O4fpDF9CZPAZHmVyDZDQDa6jKWyTTU1kikj9os=7C1osQzBp5-aSIXPeqWPXcafrLDGCeEkR3sfbiJMRDQe=
  has some nice improvements (using DrugNER and default fast pipeline). I just 
downloaded it from the svn and replaced the old cTAKES 3.2.1 file with this 
one, and it seems to run just fine and cTAKES does annotations. Can somebody 
from the devs or users tell me if this manual replacement step is OK and does 
not break anything that I am not aware of?

Many thanks for answers on any of my questions, Tom

RE: Hello cTAKES Mailing List

2015-02-23 Thread Finan, Sean

The CHV is a good resource for some things, but before going through the 
motions of porting it to a ctakes format, take a look inside.  

-Original Message-
From: Pei Chen [mailto:chen...@apache.org] 
Sent: Monday, February 23, 2015 1:52 PM
To: dev@ctakes.apache.org
Subject: Re: Hello cTAKES Mailing List

Raymond,
Probably a combination of UMLS *Consumer Health Vocabulary + Custom Dictionary 
(as Sean described) *may work for the use case*:* OAC CHV connects informal, 
common words and phrases about health to technical terms used by health care 
professionals. It includes jargon, slang, ambiguous, and misspelled words as 
used by consumers and health care professionals. Due to its nature, OAC CHV 
includes concepts that are not represented by other source vocabularies within 
the Metathesaurus.

[1] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nlm.nih.gov_research_umls_sourcereleasedocs_current_CHV_d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=1Bkpeno1tqLjX78o0wYm5DmJHCHlK7hrxpeEgPnGtRMs=-rEmTgTCe0mkSXT34XK56zkiuy_VxIfFvngGJzUwem8e=

On Sun, Feb 22, 2015 at 10:37 AM, Finan, Sean  
sean.fi...@childrens.harvard.edu wrote:

 Hi Raymond,

 If you use the dictionary-fast module there exists an entry feeling bad
 with cui 557911 and cui 231218.  There is also feel bad and feeling 
 bad emotionally

 You will find horrible present pain but no other entry with horrible.
  You will not find any terms with awful and probably many other 
 desired words.  If you are really interested in slang crappy, 
 lousy, etc. then they are definitely not present.

 What you can do is create a second dictionary.  There are example 
 custom dictionaries in 
 -dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dicti
 onary/lookup/fast/example/bsv/ You should look at custom_cui_bsv.bsv 
 if you want to specify term unique id codes and term text alone.  If 
 you want to add tui/group codes then look at custom_cui_tui_bsv.bsv  - 
 you will probably want to model your dictionary after this so that you 
 can tag your terms with tuis for symptoms.

 You will want to imitate sections from the corresponding .xml file in that
 directory.   Make a copy of cTakesHsql.xml (two dirs up) and add lines:
   dictionary
  nameCustomCuiRareWord/name

 implementationNameorg.apache.ctakes.dictionary.lookup2.BsvRareWordDictionary/implementationName
  properties
 property key=bsvPath
 value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/
  /properties
   /dictionary

 And

   conceptFactory
  nameCustomCuiConcept/name

 implementationNameorg.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory/implementationName
  properties
 property key=bsvPath
 value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/
  /properties
   /conceptFactory

 And

   dictionaryConceptPair
  nameCustomPair/name
  dictionaryNameCustomCuiRareWord/dictionaryName
  conceptFactoryNameCustomCuiConcept/conceptFactoryName
   /dictionaryConceptPair

 Then make sure that you point to your custom cTakesHsql.xml in 
 dictionary-fast/desc/analysis_engine/UmlsLookupAnnotator.xml (or 
 Overlap depending upon your use):

 nameDictionaryDescriptorFile/name
 description/
 fileResourceSpecifier

 fileUrlfile:org/apache/ctakes/dictionary/lookup/fast/cTakesHsqlYourCopy.xml/fileUrl
 /fileResourceSpecifier

 You can also skip the UMLS dictionary altogether and just use your 
 custom dictionary.

 If you do give this a try then let me know  how it goes.  If you need 
 additional assistance let me know and I will help the best I can.

 Sean

 -Original Message-
 From: Raymond Li [mailto:ray...@bu.edu]
 Sent: Saturday, February 21, 2015 1:26 PM
 To: dev@ctakes.apache.org
 Subject: Hello cTAKES Mailing List

 Hello, my name is is Raymond Li and I am currently working on a team 
 project involving cTAKES. The goal of our project would be to use 
 cTAKES to analyze posts on social media (such as tweets, forum posts, 
 public available data) in order to catch in real-time any adverse 
 effects of prescribed drugs and do a public service of protecting 
 people from harmful drugs.

 Aside from this introduction, I do have only one question to ask to 
 proceed with this project: Is cTAKES capable of understanding slang 
 words as symptoms. An example is if I were to say I took Crestor and feeling 
 bad
 is there a way for cTAKES to recognize that Crestor had a negative effect?
 My team has not been able to isolate 'bad' as a negative effect as it 
 is not a defined medical symptom, but it would be nice to figure out 
 if such a solution exists, or if we would need to develop our own 
 solution and how we could go around doing it.

 My team and I would appreciate any comments or assistance regarding

URGENT! RE: New Website

2015-02-25 Thread Finan, Sean

Hi all,

It looks like a few people (myself included) are interested in having 
information on people, projects, papers, and applications that use cTAKES on 
the web page.  I have created a form on google that might help us collect this 
and other information.  Please visit 
https://docs.google.com/forms/d/10ryw42aqkIf2ygjNTa_To1OgGDZzDqHizVg__Jxyuws/viewform?usp=send_form
Most of the form is multiple choice, so it only takes a minute or two to 
complete it.  The more information we have the better we can develop and 
promote cTAKES, so this is very important.

Thank you,
Sean

-Original Message-
From: Mohammad Alodadi [mailto:mso1...@gmail.com] 
Sent: Wednesday, February 25, 2015 2:09 AM
To: dev@ctakes.apache.org
Subject: Re: New Website

I like the look of the new website. 
I was thinking, if someone could collect references of all the research papers, 
that use cTakes in their methodology, in a page and include the link in the use 
cases page, that would be a very great idea to see the different uses of cTakes.

Sincerely,

Mohammad Alodadi


 On Feb 24, 2015, at 8:46 PM, taposh.d@kp.org wrote:
 
 Hi Michelle -
 
 The site looks nice. 
 Would it be possible to add link to source via svn or github. 
 Also, case studies would help potential people. 
 
 Regards,
 
 Taposh D. Roy
 Health Data Lead
 Decision Support Team
 
 Kaiser Permanente
 Program Office
 1950 Franklin Street, 17th Floor
 Oakland, California 94588
 510-987-4121 (Office)
 510-206-1633 (cell)
 
 
 
 NOTICE TO RECIPIENT:  If you are not the intended recipient of this 
 e-mail, you are prohibited from sharing, copying, or otherwise using 
 or disclosing its contents.  If you have received this e-mail in 
 error, please notify the sender immediately by reply e-mail and 
 permanently delete this e-mail and any attachments without reading, 
 forwarding or saving them.  Thank you.
 
 
 
 
 
 From:   Michelle Chen miche...@apache.org
 To: dev@ctakes.apache.org
 Date:   02/24/2015 04:30 PM
 Subject:New Website
 
 
 
 Hello everyone,
 
 We are planning on publishing the new website on March 2, 2015. Here 
 is the link to the proposed site:
 https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_repos_asf_ctakes_site_new_index.htmld=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=SFiBdOhfH5CdMkKcR10nLGTTP4hqatPnp7nAnpr_ZFws=t5Zx2haIAOy9nVTKfqs7L7uRwblbLJ7imHALjP2iMqIe=
  .
 (Note: Not all of the pages are fully functional yet, but we figured 
 that this new look is exciting news and wanted feedback.)
 
 Some ideas for feedback:
 1. Succinct quotations from users and devs about How has cTakes 
 helped you? so that we can populate the Why cTAKES? page. (with 
 permission to use information of your name, position, employer, and/or 
 product/project) 2. Use cases (with potential screenshots) of cTAKES 
 to populate the Examples page of GUI or other use cases. The 
 examples page is in the process of being revamped.
 3. Mobile feedback: This has not been tested on devices, but what 
 would be needed/useful?
 4. What is missing from the web page? E.g. FAQs, useful tips. Where 
 are there broken links?
 6. Anything!
 
 We welcome any suggestions or code contributions directly the website 
 itself. Look forward to hearing from everyone. Have a great day.
 
 
 Sincerely,
 Michelle Chen

RE: Hello cTAKES Mailing List

2015-02-22 Thread Finan, Sean

Hi Raymond,

If you use the dictionary-fast module there exists an entry feeling bad with 
cui 557911 and cui 231218.  There is also feel bad and feeling bad 
emotionally

You will find horrible present pain but no other entry with horrible.   You 
will not find any terms with awful and probably many other desired words.  If 
you are really interested in slang crappy, lousy, etc. then they are 
definitely not present.

What you can do is create a second dictionary.  There are example custom 
dictionaries in 
-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/example/bsv/
You should look at custom_cui_bsv.bsv if you want to specify term unique id 
codes and term text alone.  If you want to add tui/group codes then look at 
custom_cui_tui_bsv.bsv  - you will probably want to model your dictionary after 
this so that you can tag your terms with tuis for symptoms.

You will want to imitate sections from the corresponding .xml file in that 
directory.   Make a copy of cTakesHsql.xml (two dirs up) and add lines: 
  dictionary
 nameCustomCuiRareWord/name
 
implementationNameorg.apache.ctakes.dictionary.lookup2.BsvRareWordDictionary/implementationName
 properties
property key=bsvPath 
value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/
 /properties
  /dictionary

And

  conceptFactory
 nameCustomCuiConcept/name
 
implementationNameorg.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory/implementationName
 properties
property key=bsvPath 
value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/
 /properties
  /conceptFactory

And

  dictionaryConceptPair
 nameCustomPair/name
 dictionaryNameCustomCuiRareWord/dictionaryName
 conceptFactoryNameCustomCuiConcept/conceptFactoryName
  /dictionaryConceptPair

Then make sure that you point to your custom cTakesHsql.xml in 
dictionary-fast/desc/analysis_engine/UmlsLookupAnnotator.xml (or Overlap 
depending upon your use):

nameDictionaryDescriptorFile/name
description/
fileResourceSpecifier
   
fileUrlfile:org/apache/ctakes/dictionary/lookup/fast/cTakesHsqlYourCopy.xml/fileUrl
/fileResourceSpecifier

You can also skip the UMLS dictionary altogether and just use your custom 
dictionary.

If you do give this a try then let me know  how it goes.  If you need 
additional assistance let me know and I will help the best I can.

Sean


-Original Message-
From: Raymond Li [mailto:ray...@bu.edu] 
Sent: Saturday, February 21, 2015 1:26 PM
To: dev@ctakes.apache.org
Subject: Hello cTAKES Mailing List

Hello, my name is is Raymond Li and I am currently working on a team project 
involving cTAKES. The goal of our project would be to use cTAKES to analyze 
posts on social media (such as tweets, forum posts, public available data) in 
order to catch in real-time any adverse effects of prescribed drugs and do a 
public service of protecting people from harmful drugs.

Aside from this introduction, I do have only one question to ask to proceed 
with this project: Is cTAKES capable of understanding slang words as symptoms. 
An example is if I were to say I took Crestor and feeling bad
is there a way for cTAKES to recognize that Crestor had a negative effect?
My team has not been able to isolate 'bad' as a negative effect as it is not a 
defined medical symptom, but it would be nice to figure out if such a solution 
exists, or if we would need to develop our own solution and how we could go 
around doing it.

My team and I would appreciate any comments or assistance regarding our project 
and this current issue. Thank you and have a nice day!

--
Sincerely,

Raymond Li

RE: ctakessorx for AggregatePlaintextFastUMLSProcessor.xml

2015-03-27 Thread Finan, Sean

Maite,

You already have a thread going with me offline.  If you have a question please 
ask it on that thread to refrain from spamming the devlist.  Until I have a 
chance to create decent documentation you are stuck with me.

Sean

From: Maite Meseure Hugues [meseure.ma...@gmail.com]
Sent: Friday, March 27, 2015 3:59 PM
To: dev@ctakes.apache.org
Subject: ctakessorx for AggregatePlaintextFastUMLSProcessor.xml

Hi everyone,

I am currently using AggregatePlaintextFastUMLSProcessor.xml and trying to
use my own dictionary. I would like to understand ctakessnorx script file,
how it's made etc, I didn't find any info. Thank you.

--
--
 Maïté Meseure Hugues

RE: Prep for upcoming cTAKES 3.2.2 Patch Release

2015-04-30 Thread Finan, Sean

+1 for pushing forward

I may have been one of the voices commenting on memory bloat, but I agree with 
Pei re: improving the new.  The more use, the more attention and more 
improvement (hopefully).  I can't speak of the accuracy old v. new as I haven't 
actually comparatively tested them.  And there is always the option of manually 
selecting another component.

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Thursday, April 30, 2015 10:25 AM
To: dev@ctakes.apache.org
Subject: RE: Prep for upcoming cTAKES 3.2.2 Patch Release

My vote would be to push forward.
The old assertion module also had it's share of bugs/issues and gives an 
incentive to improve the new models.  
And there's currently always the option for a user to easily revert back to the 
old since it's not removed yet...
--Pei

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Thursday, April 30, 2015 9:14 AM
To: dev@ctakes.apache.org
Subject: Re: Prep for upcoming cTAKES 3.2.2 Patch Release

A question about the default pipelines. There has been some concern about the 
new assertion modules (the machine learning ones that I worked on), partially 
due to some less intuitive error modes than negex and partially due to its 
reliance on the dependency parser which increases the memory footprint 
substantially. Should we consider reverting to the rule-based negation for the 
default pipeline (thus also removing the dependency parser from the default 
pipeline)? I'm not sure what that would mean for the other assertion modules 
(uncertainty, generic, subject, hypothetical) -- but I think it means they 
would not exist.

I can see arguments both ways. I also think if we revert we would want to have 
some way for people to access all the machine learning assertion modules if 
they want them.

Tim


On 04/29/2015 06:04 PM, Chen, Pei wrote:
 FYI- I will plan to create a 3.2.2 branch from trunk this week in prep for 
 the 3.2.2 release so others can continue their work in trunk.
 Feel free to put any changes in trunk now if you want to have it included in 
 the 3.2.2 patch release.
 The main changes are:

 1)  Improved temporal models

 2)  Minor bug fixes reported in Jira

 From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
 Sent: Thursday, March 12, 2015 12:55 PM
 To: dev@ctakes.apache.org
 Subject: Prep for upcoming cTAKES 3.2.2 Patch Release

 I was thinking of creating a 3.2.2 release for Mar (it's long passed the 
 original Jan date?)  I can volunteer to be the RM again.
 There are still plenty of unresolved items... If you plan to have anything 
 you would like included in the upcoming release, please mark it in Jira and 
 plan the commits accordingly...

 Jira Items:
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org
 _jira_issues_-3Fjql-3DfixVersion-2520-253D-25203.2.2-2520AND-2520proje
 ct-2520-253D-2520CTAKESd=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdio
 CoppxeFUr=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx
 6674hm=2WI-fDHF0jDSXyUcTxv5U4_T_w9MBjbDAw3ZRYgoLXss=CF0gyLPeOyRvUjRy
 Vm_rcl8SaFUtPTMmfrLObpiHtxMe=
 1-25 of 25
 Columns
 T

 Patch Info

 Key

 Summary

 Assignee

 Reporter

 P

 Status

 Resolution

 Created

 Updated

 Due

 [Bug]https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apac
 he.org_jira_browse_CTAKES-2D349d=BQMFAgc=qS4goWBT7poplM69zy_3xhKwEW1
 4JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=pMfOt
 BAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=OuUBnh20dG00BWWGMKNkCLddKAzEK
 EiFP3s5uMqcXvUe=

 CTAKES-349https://urldefense.proofpoint.com/v2/url?u=https-3A__issues
 .apache.org_jira_browse_CTAKES-2D349d=BQMFAgc=qS4goWBT7poplM69zy_3xh
 KwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=
 pMfOtBAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=OuUBnh20dG00BWWGMKNkCLdd
 KAzEKEiFP3s5uMqcXvUe=


 JdbcWriterTemplate does not store rows if there are fewer than 100 per 
 notehttps://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apach
 e.org_jira_browse_CTAKES-2D349d=BQMFAgc=qS4goWBT7poplM69zy_3xhKwEW14
 JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=pMfOtB
 Aj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=OuUBnh20dG00BWWGMKNkCLddKAzEKE
 iFP3s5uMqcXvUe=

 Unassigned

 Sean
 Finanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apac
 he.org_jira_secure_ViewProfile.jspa-3Fname-3Dseanfinand=BQMFAgc=qS4g
 oWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEY
 ujfPwiAxhPVz5WYm=pMfOtBAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=0eQpWY
 xtyJWqM1JvCN8qkioGRcjID0-QD5k2tf9-1Rce=

 [Major]

 OPEN

 Unresolved

 12/Mar/15

 12/Mar/15



 [Bug]https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apac
 he.org_jira_browse_CTAKES-2D347d=BQMFAgc=qS4goWBT7poplM69zy_3xhKwEW1
 4JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=pMfOt
 BAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=ja8aLYd7A_7XF8HGNZlgwYtf57IaT
 kNbKjuO-LfG1Nwe=

RE: build tool suggestion

2015-05-06 Thread Finan, Sean

Your IDE should have settings that allow custom warnings.  Also check out 
findbugs -- http://en.wikipedia.org/wiki/FindBugs

There might be a configurable maven plugin.  

It is a process ...

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Tuesday, May 05, 2015 8:01 PM
To: dev@ctakes.apache.org
Subject: build tool suggestion


Do you know offhand, would it be easy to have something run at build time that 
flags uses of FileReader?

Related - do we have anything at build time that produces warnings that are 
looked at?  When I check in a change, I just check whether the next build is 
successful or not.  I don't look for warnings other than what I see when I try 
a compile of my own on my own system.  Ideally I think it would be good to have 
the use of FileReader cause a meaningful warning.  But if there's no relatively 
easy way to do that, might we consider having it cause a build failure?  I 
think the benefits would outweigh the drawbacks.

-- James


From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, May 05, 2015 5:55 PM
To: dev@ctakes.apache.org
Subject: RE: svn commit: r1677903 - in 
/ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2:
 concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java 
util/JdbcConnectionFactory.java

Can we use InputStreamReader instead of FileReader?
That way the resource can also be read from within a jar (potentially from 
maven central, etc.) and doesn't have to be fixed to a physical file...

i.e.
Instead of new BufferedReader(new FileReader(path)) new BufferedReader(new 
InputStreamReader(FileLocator.getAsStream(path)))

--Pei

-Original Message-
From: seanfi...@apache.org [mailto:seanfi...@apache.org]
Sent: Tuesday, May 05, 2015 6:42 PM
To: comm...@ctakes.apache.org
Subject: svn commit: r1677903 - in 
/ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2:
 concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java 
util/JdbcConnectionFactory.java

Author: seanfinan
Date: Tue May  5 22:41:26 2015
New Revision: 1677903

URL: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_r1677903d=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=wuwFl1DxU-yGWdGewROupvowHfYFay_u5LYKJUJF2VAe=
Log:
Use FileLocator to find BSV dictionaries

Modified:

ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java

ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/dictionary/BsvRareWordDictionary.java

ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/util/JdbcConnectionFactory.java

Modified: 
ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java
URL: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_viewvc_ctakes_trunk_ctakes-2Ddictionary-2Dlookup-2Dfast_src_main_java_org_apache_ctakes_dictionary_lookup2_concept_BsvConceptFactory.java-3Frev-3D1677903-26r1-3D1677902-26r2-3D1677903-26view-3Ddiffd=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=N_IOanbEYnXUTZ4ZO3vIjOeYun186kZGjXPKWp-Wi7ke=
==
--- 
ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java
 (original)
+++ ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/
+++ ctakes/dictionary/lookup2/concept/BsvConceptFactory.java Tue May  5
+++ 22:41:26 2015
@@ -1,5 +1,6 @@
 package org.apache.ctakes.dictionary.lookup2.concept;

+import org.apache.ctakes.core.resource.FileLocator;
 import org.apache.ctakes.dictionary.lookup2.util.CuiCodeUtil;
 import org.apache.ctakes.dictionary.lookup2.util.LookupUtil;
 import org.apache.ctakes.dictionary.lookup2.util.TuiCodeUtil;
@@ -34,11 +35,12 @@ final public class BsvConceptFactory imp
}

public BsvConceptFactory( final String name, final String bsvFilePath ) {
-  this( name, new File( bsvFilePath ) );
-   }
-
-   public BsvConceptFactory( final String name, final File bsvFile ) {
-  final CollectionCuiTuiTerm cuiTuiTerms = parseBsvFile( bsvFile );
+//  this( name, new File( bsvFilePath ) );
+//   }
+//
+//   public BsvConceptFactory( final String name, final File bsvFile ) {
+//  final CollectionCuiTuiTerm cuiTuiTerms = parseBsvFile( bsvFile );
+  final CollectionCuiTuiTerm cuiTuiTerms = parseBsvFile( 
+bsvFilePath );
   final MapLong, Concept conceptMap = new HashMap( cuiTuiTerms.size() 
);
   for ( CuiTuiTerm cuiTuiTerm : cuiTuiTerms ) {

RE: UMLS Authentication failing despite correct username and password

2015-05-11 Thread Finan, Sean

Hi Pedro,

Check the cTakesHsql.xml and make sure that the line matches:

property key=umlsUrl 
value=https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser/

In an older version of cTAKES with an output message as you have:
11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Default - Loading 
dictionary into memory.  Initial run may take few mins to load. Please be 
patient...
That line got corrupted.

Sean

-Original Message-
From: Pedro Teixeira [mailto:teixeir...@gmail.com] 
Sent: Monday, May 11, 2015 5:30 PM
To: dev@ctakes.apache.org
Subject: UMLS Authentication failing despite correct username and password

So I've checked the Dictionary lookup XML file and that password works to log 
in via the website. This was also working last week but stopped at some point 
over the last week. I've got cTAKES running on a linux system so I can index 
batches of documents via a script. The exact error is as follows (with the 
username/password blocked out).

11 May 2015 15:59:26  INFO LvgCmdApiResourceImpl - cwd =
/home/PT/cTAKES/apache-ctakes-3.2.1
11 May 2015 15:59:26  INFO LvgCmdApiResourceImpl - cd 
/home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/lvg/
11 May 2015 15:59:27  INFO LvgCmdApiResourceImpl - cd
/home/PT/cTAKES/apache-ctakes-3.2.1
11 May 2015 15:59:27  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:

11 May 2015 15:59:42  INFO Chunker - Chunker model file:
org/apache/ctakes/chunker/models/chunker-model.zip
11 May 2015 15:59:44  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.
11 May 2015 15:59:44  INFO ConstituencyParser - Initializing parser...
11 May 2015 15:59:46  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
11 May 2015 15:59:46  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer
11 May 2015 15:59:47  INFO POSTagger - POS tagger model file:
org/apache/ctakes/postagger/models/mayo-pos.zip
11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Default - Loading 
dictionary into memory.  Initial run may take few mins to load. Please be 
patient...
11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Using dictionary lookup 
window type: org.apache.ctakes.typesystem.type.textspan.Sentence
11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Exclusion tagset
loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD VBG VBN VBP 
VBZ WDT WP WPS WRB
11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Using minimum term text 
span: 3
11 May 2015 15:59:47  INFO DictionaryDescriptorParser - Parsing dictionary
specifications:
/home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml
11 May 2015 15:59:48 ERROR UmlsUserApprover - UMLS Account at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=oVzYGAl69NhMu6lQpKeatJrIGk2o_z2AZvjq7Z5J69gs=_JNevHgYhyKm5PjIyFlYxIS1UWuR7J-n5V551hou2dMe=
  is not valid for user # with ## Couldn't 
initialize processing engine.
  Initialization of CAS Processor with name 
AggregatePlaintextFastUMLSProcessor failed.


I also have a test implementation on a local windows 8 laptop that also fails 
now due to the same error so it seems like it's UMLS related issue but I 
haven't heard back from them yet and was hoping perhaps someone with cTAKES has 
previously experienced and resolved the issue.

Thanks!

RE: UMLS Authentication failing despite correct username and password

2015-05-11 Thread Finan, Sean

Argh.  Our email server may have mucked with the url that I pasted:

H t t p s : / / uts - ws . nlm . nih . gov / restful / isValidUMLSUser

property key=umlsUrl value= INSERT URL HERE, NO SPACES /

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Monday, May 11, 2015 5:38 PM
To: dev@ctakes.apache.org
Subject: RE: UMLS Authentication failing despite correct username and password

Hi Pedro,

Check the cTakesHsql.xml and make sure that the line matches:

property key=umlsUrl 
value=https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=bSJDuEveKkCQoYKfh2CwhxDx8I92siVZvxm45BoxGtEs=A5wwcyQgQrPQ_dWwnaF-QHqZb0ttus_rzS-A6UDh-S8e=
 /

In an older version of cTAKES with an output message as you have:

11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Default - Loading 
dictionary into memory.  Initial run may take few mins to load. Please be 
patient...

That line got corrupted.

Sean

-Original Message-

From: Pedro Teixeira [mailto:teixeir...@gmail.com] 

Sent: Monday, May 11, 2015 5:30 PM

To: dev@ctakes.apache.org

Subject: UMLS Authentication failing despite correct username and password

So I've checked the Dictionary lookup XML file and that password works to log 
in via the website. This was also working last week but stopped at some point 
over the last week. I've got cTAKES running on a linux system so I can index 
batches of documents via a script. The exact error is as follows (with the 
username/password blocked out).

11 May 2015 15:59:26  INFO LvgCmdApiResourceImpl - cwd =

/home/PT/cTAKES/apache-ctakes-3.2.1

11 May 2015 15:59:26  INFO LvgCmdApiResourceImpl - cd 
/home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/lvg/

11 May 2015 15:59:27  INFO LvgCmdApiResourceImpl - cd

/home/PT/cTAKES/apache-ctakes-3.2.1

11 May 2015 15:59:27  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true Loading configuration.

Loading feature templates.

Loading lexica.

Loading model:

11 May 2015 15:59:42  INFO Chunker - Chunker model file:

org/apache/ctakes/chunker/models/chunker-model.zip

11 May 2015 15:59:44  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.

11 May 2015 15:59:44  INFO ConstituencyParser - Initializing parser...

11 May 2015 15:59:46  INFO ContextAnnotator - SCOPE ORDER: [1, 3]

11 May 2015 15:59:46  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer

11 May 2015 15:59:47  INFO POSTagger - POS tagger model file:

org/apache/ctakes/postagger/models/mayo-pos.zip

11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Default - Loading 
dictionary into memory.  Initial run may take few mins to load. Please be 
patient...

11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Using dictionary lookup 
window type: org.apache.ctakes.typesystem.type.textspan.Sentence

11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Exclusion tagset

loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD VBG VBN VBP 
VBZ WDT WP WPS WRB

11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Using minimum term text 
span: 3

11 May 2015 15:59:47  INFO DictionaryDescriptorParser - Parsing dictionary

specifications:

/home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml

11 May 2015 15:59:48 ERROR UmlsUserApprover - UMLS Account at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=oVzYGAl69NhMu6lQpKeatJrIGk2o_z2AZvjq7Z5J69gs=_JNevHgYhyKm5PjIyFlYxIS1UWuR7J-n5V551hou2dMe=
  is not valid for user # with ## Couldn't 
initialize processing engine.

  Initialization of CAS Processor with name 
AggregatePlaintextFastUMLSProcessor failed.

I also have a test implementation on a local windows 8 laptop that also fails 
now due to the same error so it seems like it's UMLS related issue but I 
haven't heard back from them yet and was hoping perhaps someone with cTAKES has 
previously experienced and resolved the issue.

Thanks!

RE: build tool suggestion

2015-05-06 Thread Finan, Sean

I understood that.  I check warnings before checkin.  You can do a search for 
something like https://wiki.jenkins-ci.org/display/JENKINS/Warnings+Plugin

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Wednesday, May 06, 2015 10:58 AM
To: 'dev@ctakes.apache.org'
Subject: RE: build tool suggestion

Sorry, I wasn't clear, when I said at build time, I meant the Jenkins 
automated build. 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Wednesday, May 06, 2015 9:52 AM
To: dev@ctakes.apache.org
Subject: RE: build tool suggestion

Your IDE should have settings that allow custom warnings.  Also check out 
findbugs -- 
https://urldefense.proofpoint.com/v2/url?u=http-3A__en.wikipedia.org_wiki_FindBugsd=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lQlT1hCegc_WtmY10BMmAxiwIIHNxqohrwW7CfGCFq8s=OQDh6ra47IQjNVh7WZteWKCf_xeSae36jIo_qcjxfS8e=

There might be a configurable maven plugin.  

It is a process ...

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Tuesday, May 05, 2015 8:01 PM
To: dev@ctakes.apache.org
Subject: build tool suggestion

Do you know offhand, would it be easy to have something run at build time that 
flags uses of FileReader?

Related - do we have anything at build time that produces warnings that are 
looked at?  When I check in a change, I just check whether the next build is 
successful or not.  I don't look for warnings other than what I see when I try 
a compile of my own on my own system.  Ideally I think it would be good to have 
the use of FileReader cause a meaningful warning.  But if there's no relatively 
easy way to do that, might we consider having it cause a build failure?  I 
think the benefits would outweigh the drawbacks.

-- James

From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, May 05, 2015 5:55 PM
To: dev@ctakes.apache.org
Subject: RE: svn commit: r1677903 - in 
/ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2:
 concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java 
util/JdbcConnectionFactory.java

Can we use InputStreamReader instead of FileReader?
That way the resource can also be read from within a jar (potentially from 
maven central, etc.) and doesn't have to be fixed to a physical file...

i.e.
Instead of new BufferedReader(new FileReader(path)) new BufferedReader(new 
InputStreamReader(FileLocator.getAsStream(path)))

--Pei

-Original Message-
From: seanfi...@apache.org [mailto:seanfi...@apache.org]
Sent: Tuesday, May 05, 2015 6:42 PM
To: comm...@ctakes.apache.org
Subject: svn commit: r1677903 - in 
/ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2:
 concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java 
util/JdbcConnectionFactory.java

Author: seanfinan
Date: Tue May  5 22:41:26 2015
New Revision: 1677903

URL: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_r1677903d=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=wuwFl1DxU-yGWdGewROupvowHfYFay_u5LYKJUJF2VAe=
Log:
Use FileLocator to find BSV dictionaries

Modified:

ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java

ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/dictionary/BsvRareWordDictionary.java

ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/util/JdbcConnectionFactory.java

Modified: 
ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java
URL: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_viewvc_ctakes_trunk_ctakes-2Ddictionary-2Dlookup-2Dfast_src_main_java_org_apache_ctakes_dictionary_lookup2_concept_BsvConceptFactory.java-3Frev-3D1677903-26r1-3D1677902-26r2-3D1677903-26view-3Ddiffd=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=N_IOanbEYnXUTZ4ZO3vIjOeYun186kZGjXPKWp-Wi7ke=
==
--- 
ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java
 (original)
+++ ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/
+++ ctakes/dictionary/lookup2/concept/BsvConceptFactory.java Tue May  5
+++ 22:41:26 2015
@@ -1,5 +1,6 @@
 package org.apache.ctakes.dictionary.lookup2.concept;

+import org.apache.ctakes.core.resource.FileLocator;
 import org.apache.ctakes.dictionary.lookup2.util.CuiCodeUtil;
 import

RE: UMLS Authentication failing despite correct username and password

2015-05-14 Thread Finan, Sean

Hi Pedro,

 B). If the user has already downloaded the UMLS isn't that already indicative 
 that they had a valid account?
As I understand it (I wasn't around at the time) this per-user licensing with a 
jit check was the deal that was worked out with the NLM.  I think that 
repackaging and redistributing any form of the UMLS was not (legally) done 
before ctakes worked out the current arrangement.  
I think have heard ytex had an initial check upon installation, and we have 
talked about (would like to) use this model.   The only drawback is a single 
download, multiple install site distribution possibility -  which NLM didn't 
like.
My information could be woefully outdated or just plain wrong.  If anybody out 
there knows better then please chip in.

Sean

P.S.  If anybody would like to try to advocate a different arrangement with the 
NLM then that would be great.

-Original Message-
From: Pedro [mailto:teixeir...@gmail.com] 
Sent: Thursday, May 14, 2015 9:43 AM
To: dev@ctakes.apache.org
Subject: Re: UMLS Authentication failing despite correct username and password

Agreed. Doing a direct string comparison seems like it will just break at the 
very next update.

A). A check to parse the XML result looking for a result tag and that the 
contents are True seems better

B). I'm not familiar with the history of that particular check but it seems 
overly restrictive to require a valid UMLS account check for every single run. 
If the user has already downloaded the UMLS isn't that already indicative that 
they had a valid account? I realize there are more ways around it in that case 
but requiring an internet connection just to run one of the UMLS analysis 
engines seems... suboptimal.

Thanks for all the help sorting this out!

RE: UMLS Authentication failing despite correct username and password

2015-05-12 Thread Finan, Sean

Hi Michal,

Thank you very much for pinpointing the problem.  Pei created Jira CTAKES-359.  
I checked in a fix for both the -old- and -fast- dictionary lookups.  I also 
reported the problem to the UMLS people and forwarded your discovery to their 
mailing list.  

Unfortunately, all ctakes users need to upgrade to today's trunk version - or 
at least incorporate the required changes.  Pei is making sure that it gets 
moved into the release candidate.

Cheers,
Sean

-Original Message-
From: michal.iglew...@uqo.ca [mailto:michal.iglew...@uqo.ca] 
Sent: Monday, May 11, 2015 11:27 PM
To: dev@ctakes.apache.org
Subject: RE: UMLS Authentication failing despite correct username and password

Hi Pedro and Sean,



It seems to me that the service 
https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=jZe4a0OF4b0UehhgbKoEMUfkTADm8RVRexPavSKlqCEs=E62_dTnV7yCr1SUBnSbsSxcmyckz4y-PQkFQGoB3WQUe=
  returns now ?xml version='1.0' encoding='UTF-8'?Resulttrue/Result 
instead of Resulttrue/Result. It means that the line



result = line.trim().equalsIgnoreCase(Resulttrue/Result);



in isValidUMLSUser()  should be replaced with



result = line.trim().equalsIgnoreCase(?xml version='1.0' 
encoding='UTF-8'?Resulttrue/Result);



Michal



-Message d'origine-

De : Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 

Envoyé : May-11-15 5:41 PM

À : dev@ctakes.apache.org

Objet : RE: UMLS Authentication failing despite correct username and password



Argh.  Our email server may have mucked with the url that I pasted:



H t t p s : / / uts - ws . nlm . nih . gov / restful / isValidUMLSUser



property key=umlsUrl value= INSERT URL HERE, NO SPACES /



-Original Message-

From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 

Sent: Monday, May 11, 2015 5:38 PM

To: dev@ctakes.apache.org

Subject: RE: UMLS Authentication failing despite correct username and password



Hi Pedro,







Check the cTakesHsql.xml and make sure that the line matches:







property key=umlsUrl 
value=https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=bSJDuEveKkCQoYKfh2CwhxDx8I92siVZvxm45BoxGtEs=A5wwcyQgQrPQ_dWwnaF-QHqZb0ttus_rzS-A6UDh-S8e=
 /







In an older version of cTAKES with an output message as you have:



11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Default - Loading 
dictionary into memory.  Initial run may take few mins to load. Please be 
patient...



That line got corrupted.







Sean







-Original Message-



From: Pedro Teixeira [mailto:teixeir...@gmail.com] 



Sent: Monday, May 11, 2015 5:30 PM



To: dev@ctakes.apache.org



Subject: UMLS Authentication failing despite correct username and password







So I've checked the Dictionary lookup XML file and that password works to log 
in via the website. This was also working last week but stopped at some point 
over the last week. I've got cTAKES running on a linux system so I can index 
batches of documents via a script. The exact error is as follows (with the 
username/password blocked out).







11 May 2015 15:59:26  INFO LvgCmdApiResourceImpl - cwd =



/home/PT/cTAKES/apache-ctakes-3.2.1



11 May 2015 15:59:26  INFO LvgCmdApiResourceImpl - cd 
/home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/lvg/



11 May 2015 15:59:27  INFO LvgCmdApiResourceImpl - cd



/home/PT/cTAKES/apache-ctakes-3.2.1



11 May 2015 15:59:27  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true Loading configuration.



Loading feature templates.



Loading lexica.



Loading model:







11 May 2015 15:59:42  INFO Chunker - Chunker model file:



org/apache/ctakes/chunker/models/chunker-model.zip



11 May 2015 15:59:44  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.



11 May 2015 15:59:44  INFO ConstituencyParser - Initializing parser...



11 May 2015 15:59:46  INFO ContextAnnotator - SCOPE ORDER: [1, 3]



11 May 2015 15:59:46  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer



11 May 2015 15:59:47  INFO POSTagger - POS tagger model file:



org/apache/ctakes/postagger/models/mayo-pos.zip



11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Default - Loading 
dictionary into memory.  Initial run may take few mins to load. Please be 
patient...



11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Using dictionary lookup 
window type: org.apache.ctakes.typesystem.type.textspan.Sentence



11 May 2015 15:59:47  INFO AbstractJCasTermAnnotator - Exclusion tagset



loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD VBG VBN VBP 
VBZ WDT WP WPS WRB



11 May 2015 15

RE: DB DictionaryLookupAnnotator sqlserver exception

2015-04-15 Thread Finan, Sean

Hi Alex,

This is some pretty odd behavior.  Obviously, it is indicating that the 
resource type loaded or specified is not the correct class.  Specification is 
(for the standard UMLS pipeline) in 
ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml 
lines #226 and #289.  Both should be

implementationNameorg.apache.ctakes.core.resource.JdbcConnectionResourceImpl/implementationName

There is an identical specification on line #352, but that is for Orangebook 
which (I'm pretty sure) is no longer used and I think that this is one of a 
couple sections that was missed during refactoring, so you can ignore it.

If you are running from source then you could try editing 
org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.java lines #140, 
#141 and add to the exception message something like
+  instead of  + (extResrc == null ? NULL : extResrc.getClass().getName() )
To find out what it thinks that it has underfoot.

Sean


From: Milinovich, Alex [mailto:mili...@ccf.org]
Sent: Wednesday, April 15, 2015 12:50 PM
To: dev@ctakes.apache.org
Subject: DB DictionaryLookupAnnotator sqlserver exception

Attempting to use the sqlserver jdbc connection for the 
DictionaryLookupAnnotator.  When loading the aggregate engine, the connection 
is established fine, but then it gives the error -

java.lang.Exception: Expected external resource to be:interface 
org.apache.ctakes.core.resource.JdbcConnectionResource
at 
org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.parseDictionaryXml(LookupParseUtilities.java:140)
at 
org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.parseDictionaries(LookupParseUtilities.java:94)
at 
org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.parseDescriptor(LookupParseUtilities.java:80)
at 
org.apache.ctakes.dictionary.lookup.ae.DictionaryLookupAnnotator.configInit(DictionaryLookupAnnotator.java:88)
... 26 more


Any ideas as to why this isn't working?


[cid:image001.jpg@01D0777A.A2C77340]

Alex Milinovich  |  System Analyst III  |  Quantitative Health Sciences
9500 Euclid Ave. - JJN3 | Cleveland, OH 44195 | p: (216) 444-9931 | m: (216) 
245-7655




===
Please consider the environment before printing this e-mail
Cleveland Clinic is ranked as one of the top hospitals in America by U.S.News  
World Report (2014). Visit us online at http://www.clevelandclinic.org for a 
complete listing of our services, staff and locations. Confidentiality Note: 
This message is intended for use only by the individual or entity to which it 
is addressed and may contain information that is privileged, confidential, and 
exempt from disclosure under applicable law. If the reader of this message is 
not the intended recipient or the employee or agent responsible for delivering 
the message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please contact 
the sender immediately and destroy the material in its entirety, whether 
electronic or hard copy. Thank you.

RE: TimeLanes

2015-06-22 Thread Finan, Sean

Hi Maashu,

TimeLanes is currently a prototype gui under development and there is probably 
no information about it on the web.  It is in sandbox because it isn't part of 
the ctakes release and is missing much needed functionality.  For instance, It 
should display basic information about the patient and note (name, birth date, 
note date), but such things are often in structured data or some custom header 
of the note.  Right now TimeLanes does not fetch them at all (it will require 
custom readers) and just displays Dan Testing.

If you want to run it, the main class is 
org.chboston.cnlp.timeline.gui.main.TimelineMain .  Upon startup it will 
display open a note.  You can use the Open button or drag a file into the 
box.  Unfortunately, it does not yet run ctakes (coming soon), so you need to 
give it an annotated (protégé or Anafora) note or .xmi .  Using an .xmi would 
probably be easiest as you can create it with ctakes.  You can watch an 
outdated video here:  
https://www.youtube.com/watch?v=Kp9YE0o3urUfeature=youtu.be

Sean

-Original Message-
From: maa...@gmail.com [mailto:maa...@gmail.com] 
Sent: Friday, June 12, 2015 1:18 PM
To: dev@ctakes.apache.org
Subject: TimeLanes

Hi All,

I've just started working with cTAKES and was curious about TimeLanes.  I found 
it in the sandbox here:

https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_timelanes_d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=qneEArWy0QvCgMGCuF8-DwG3kslsrGAKWFtmP174uO4s=iZj-v0HJjZccezixIOmlTFwyIGFf9OqImfSv-aMKdgIe=
 

But I'm lost on how to actually use it.  I've googled around but there seems to 
be very little information on it.

Can anyone point me in the right direction?

Thanks in advance!

Cheers,

-Maashu

--
If you are immune to boredom, there is literally nothing you cannot 
accomplish.

-David Foster Wallace

RE: RareWord term

2015-06-22 Thread Finan, Sean

Hi Maite,
I hope to have a paper out on this soon, so I am keeping things kind of quiet 
about it - though one can always look at the database and code to get an idea 
of what it means.
For anything else in the module, you can look at the wiki page:

https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+-+Fast+Dictionary+Lookup

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, June 18, 2015 12:02 PM
To: dev@ctakes.apache.org
Subject: RareWord term

Hi everyone,

I am currently using UmlsJdbcRareWordDictionary and I would like to better 
understand how is chosen the rare word term. I found this comment '
Dictionary used to lookup terms by the most rare word within them' but no more 
explanation, does anyone have any pointers?
Thank you in advance.

Maite

RE: TimeLanes

2015-06-22 Thread Finan, Sean

Just for clarification, TimeLanes does consume ctakes output (.xmi), but it 
does not produce it.  In other words, you cannot hand it a plain text file and 
expect automatic processing.  Yet.

-Original Message-
From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] 
Sent: Monday, June 22, 2015 3:02 PM
To: dev@ctakes.apache.org
Subject: RE: TimeLanes

The cTAKES temporal component is in the main release. You can get the system 
output, but as Sean said TimeLanes does not consume it yet.

A demo of the cTAKES temporal component can be found in Getting Started - 
Demos. Pei just put it up there, thank you very much, Pei!
--Guergana

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Monday, June 22, 2015 11:36 AM
To: dev@ctakes.apache.org
Subject: RE: TimeLanes

Hi Maashu,

TimeLanes is currently a prototype gui under development and there is probably 
no information about it on the web.  It is in sandbox because it isn't part of 
the ctakes release and is missing much needed functionality.  For instance, It 
should display basic information about the patient and note (name, birth date, 
note date), but such things are often in structured data or some custom header 
of the note.  Right now TimeLanes does not fetch them at all (it will require 
custom readers) and just displays Dan Testing.

If you want to run it, the main class is 
org.chboston.cnlp.timeline.gui.main.TimelineMain .  Upon startup it will 
display open a note.  You can use the Open button or drag a file into the 
box.  Unfortunately, it does not yet run ctakes (coming soon), so you need to 
give it an annotated (protégé or Anafora) note or .xmi .  Using an .xmi would 
probably be easiest as you can create it with ctakes.  You can watch an 
outdated video here:  

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_watch-3Fv-3DKp9YE0o3urU-26feature-3Dyoutu.bed=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmPm=P2Q3bVKBdvXziFnahfApZEyBbj-eR-wV-TfEZfTtl0Qs=1HETvigL__bzBXBpv2jLdRJMvJ3CI77UQZORumsBJIMe=

Sean

-Original Message-

From: maa...@gmail.com [mailto:maa...@gmail.com] 

Sent: Friday, June 12, 2015 1:18 PM

To: dev@ctakes.apache.org

Subject: TimeLanes

Hi All,

I've just started working with cTAKES and was curious about TimeLanes.  I found 
it in the sandbox here:

https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_timelanes_d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=qneEArWy0QvCgMGCuF8-DwG3kslsrGAKWFtmP174uO4s=iZj-v0HJjZccezixIOmlTFwyIGFf9OqImfSv-aMKdgIe=

But I'm lost on how to actually use it.  I've googled around but there seems to 
be very little information on it.

Can anyone point me in the right direction?

Thanks in advance!

Cheers,

-Maashu

--

If you are immune to boredom, there is literally nothing you cannot 
accomplish.

-David Foster Wallace

RE: cTakes - hsqldb connection problem

2015-06-02 Thread Finan, Sean

Hi Pankaj,

I haven't seen this exact error before.  I guess that my first steps toward a 
possible remedy would be:
- check for existence of 
/org/apache/ctakes/dictionary/lookup/umls2011ab/umls.properties
- make sure that it (resources/) is in your classpath
- see if it looks like any of the umls2011ab/ files were not fully downloaded 
(ls -l : 99069136, 410610240, 1295, 705)

I looked at the hsql source a little bit and can't really make heads or tails 
of why you'd get a 452 error (file input/output) associated with a null pointer 
exception (NPE) with the file path actually listed.  I didn't look too far into 
the tree but it doesn't look like it is thrown by any of the main entry points.

Do you have more than one version of hsql installed?  I only ask because the 
single report of a similar error message( 452: NPE)  that I found on the web 
reported it solved when they equalized all the versions.  It doesn't make sense 
to me, but it is something to check.

Sean


-Original Message-
From: Pankaj Shinde [mailto:pankaj.shi...@krixi.com] 
Sent: Tuesday, June 02, 2015 2:46 AM
To: dev@ctakes.apache.org
Subject: cTakes - hsqldb connection problem

Hi,

I have done following to get cTakes working.

1. Created java project
2. Created java class
3. Instanciated BagOfCUIsGenerator class with two arguments, input folder and 
output folder.
4. Added all required files in this java project.

When I try to run application I am getting following error.
I ran application in 'Debug' mode and I traced exception.
I found out that exception is raised in JdbcConnectionResourceImpl.java file at 
line number 109, iv_conn is null.
It seems that application is not properly connecting to hsqldb database.

Error is as follows

*Loading model:*
*.*
*Loading configuration.*
*Loading feature templates.*
*Loading lexica.*
*Loading model:*
**
*Loading model:*
*.*
*Exception in thread main
org.apache.uima.resource.ResourceInitializationException*
* at
org.apache.ctakes.core.resource.JdbcConnectionResourceImpl.load(JdbcConnectionResourceImpl.java:130)*
* at
org.apache.uima.resource.impl.ResourceManager_impl.registerResource(ResourceManager_impl.java:603)*
* at
org.apache.uima.resource.impl.ResourceManager_impl.initializeExternalResources(ResourceManager_impl.java:442)*
* at
org.apache.uima.resource.Resource_ImplBase.initialize(Resource_ImplBase.java:153)*
* at
org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.initialize(AnalysisEngineImplBase.java:157)*
* at
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:123)*
* at
org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)*
* at
org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)*
* at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)*
* at
org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:387)*
* at
org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:254)*
* at
org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:431)*
* at
org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375)*
* at
org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:185)*
* at
org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)*
* at
org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)*
* at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)*
* at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:314)*
* at
org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:425)*
* at
org.apache.uima.fit.factory.AnalysisEngineFactory.createEngineFromPath(AnalysisEngineFactory.java:773)*
* at
org.apache.ctakes.clinicalpipeline.runtime.BagOfAnnotationsGenerator.init(BagOfAnnotationsGenerator.java:60)*
* at
org.apache.ctakes.clinicalpipeline.runtime.BagOfAnnotationsGenerator.init(BagOfAnnotationsGenerator.java:54)*
* at
org.apache.ctakes.clinicalpipeline.runtime.BagOfCUIsGenerator.init(BagOfCUIsGenerator.java:34)*
* at com.krixi.cTakesDemo.main(cTakesDemo.java:12)*
* at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
* at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*
* at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
* at java.lang.reflect.Method.invoke(Method.java:606)*
* at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)*
*Caused by: java.sql.SQLException: File input/output error 
/org/apache/ctakes/dictionary/lookup/umls2011ab/umls.properties
java.lang.NullPointerException*
* at

RE: The fast dictionary pipeline vs. the regular one

2015-06-29 Thread Finan, Sean

Hi Oranit,

 Each is the Preferred Term in at least one of the 150 sources in the 
Metathesaurus. Neither is from a WHO vocabulary source. The terms are related 
in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma 
Multiforme is the Narrower term (RN).

Hmmm, I'm not sure why they assigned narrower and broader ... The two are from 
different source dictionaries and not related in such a manner.  Again, the WHO 
term is from the Mesh and NCI sources, while the full GBM spell-out is from 
CSP.  None are from the source named WHO (for adverse drugs).  See 
http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

The WHO classification scheme does not have gioblastoma multiforme at all, just 
gioblastoma.  Hence there cannot be a hierarchical relationship in that 
ontology.  Check the paper on the latest WHO classification of brain tumours: 
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1929165/ 
Or check the definition from the National Brain Tumor Society's  Tumor Types 
page: http://www.abta.org/brain-tumor-information/types-of-tumors/
 Astrocytoma Grade IV (also called Glioblastoma, previously named 
“Glioblastoma Multiforme,” “Grade IV Glioblastoma,” and “GBM”)— There are two 
types of astrocytoma grade IV—primary, or de novo, and secondary. Primary 
tumors are very aggressive and the most common form of astrocytoma grade IV. 
The secondary tumors are those which originate as a lower-grade tumor and 
evolve into a grade IV tumor.

Keep in mind that the umls is a living document and corrections are made all 
the time - it is not flawless and this might be a case that should be reported.


 In the regular pipeline, the  concept array of gbm contains the CUI of 
 Glioblastoma only, while in the fast pipeline, the concept array of GBM 
 contains the CUIs of both Glioblastoma and glioblastoma Multiforme.

Another thing to keep in mind is that the regular pipeline does not always 
provide the best discoveries.  In this case, if it is not giving you 
gioblastoma multiforme for GBM then it is providing incomplete information - as 
gioblastoma multiforme is exactly what GBM stands for and that cui should be 
provided when gbm is discovered.  Otherwise, if a researcher (possibly more 
inclined to use ...multiforme than a clinician) is searching for the 
...multiforme cui then they will not find what they are looking for and may 
think that a gbm does not exist.


I hope that this clears the air,
Sean


-Original Message-
From: Oranit Dror [mailto:ora...@algotec.co.il] 
Sent: Monday, June 29, 2015 4:44 AM
To: dev@ctakes.apache.org
Subject: RE: The fast dictionary pipeline vs. the regular one

Hi,



Thank you all for the detailed replies.



Per the Glioblastoma and  Glioblastoma Multiforme terms, I have contacted 
NLM with my question and their answer was as follows:

 Each is the Preferred Term in at least one of the 150 sources in the 
Metathesaurus. Neither is from a WHO vocabulary source. The terms are related 
in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma 
Multiforme is the Narrower term (RN).



In the regular pipeline, the  concept array of gbm contains the CUI of 
Glioblastoma only, while in the fast pipeline, the concept array of GBM 
contains the CUIs of both Glioblastoma and glioblastoma Multiforme.



Best,

Oranit.













-Original Message-

From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 

Sent: Monday, June 22, 2015 5:13 PM

To: dev@ctakes.apache.org

Subject: RE: The fast dictionary pipeline vs. the regular one



Hi all,



I’m glad that there continues to be interest in the fast alternative to the 
dictionary lookup and I welcome all testing.



GBM actually is Glioblastoma Multiforme – hence the “M”.   The WHO name is the 
abbreviated “Glioblastoma”, but they are actually not (as far as I can discern) 
different things.  If you check the metathesaurus 2011ab, GBM brings up both 
Glioblastoma C0017636 and Glioblastoma Multiforme C1621958.  The first comes 
from Mesh and NCI, the second from CSP.  If you look at the definitions they 
are synonymous: “malignant form of astrocytoma histologically characterized by 
pleomorphism of cells, nuclear atypia, microhemorrhage and necrosis; may arise 
in any region of the central nervous system, with a predilection for the 
cerebral hemispheres, basal ganglia, and commissural pathways.”  Mapping to a 
different CUI in the UMLS does not always mean that they are truly different 
concepts.  It often means that they came from 2 different source dictionaries 
(such as in this case).  Also check 
https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Glioblastoma-5Fmultiformed=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=nW5NpS7rJf0J_U27HFbGMu27dHHLm6fhDKfHs1q2VAQs=iEMBwhyzVtmLoWuNrEm-yfm0odtihzXzUyrfBq53B9Qe=
   But I am a little confused: are you saying that you got

RE: how to run i2b2 data

2015-08-07 Thread Finan, Sean

Hi Justin,

If you check out the source code, you should be able to find that class in the 
ctakes-core component.

Sean

-Original Message-
From: Justin Zhang [mailto:justinzhang...@gmail.com] 
Sent: Friday, August 07, 2015 10:45 AM
To: dev@ctakes.apache.org
Subject: Re: how to run i2b2 data

Thanks Sean for your understanding, and I am in hope now.

Where is the best place to start looking at regarding create a collection 
reader that works similarly to org.apache.ctakes.core.cr.
FilesInDirectoryCollectionReader?

Justin

On Wed, Aug 5, 2015 at 7:24 PM, Finan, Sean  sean.fi...@childrens.harvard.edu 
wrote:

 Hi Justin,

 A shot in the dark:
 You could create a collection reader that works similarly to 
 org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader , but 
 instead of grabbing all of the files in a directory it grabs all the 
 records parsed from a single .xml and runs a pipeline per record.  
 Basically, swap a directory for an .xml, a text file for an xml element 
 containing a record.
 Somebody out there might have something that already does as much.

 Sean

 -Original Message-
 From: Justin Zhang [mailto:justinzhang...@gmail.com]
 Sent: Wednesday, August 05, 2015 6:40 PM
 To: u...@ctakes.apache.org; dev@ctakes.apache.org
 Subject: how to run i2b2 data

 Hello everyone,

 I am running ctakes with i2b2 data

 https://urldefense.proofpoint.com/v2/url?u=https-3A__www.i2b2.org_NLP_
 DataSets_Main.phpd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe
 FUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=IygWj6YGkcjofGRbrDi
 FJacJHMaBveHR9qzY0VD1AAEs=swpt3QP4-B392iLlJ9wypBwD17tRDOCxPdSZOW1rS8s
 e=

 In each xml file, there are multiple patient records. I am able to 
 separate each patient into single files and process them with runCPE.sh

 Is there a way to convert this single xml file into the format ctakes
 accepted, and process as a single input file, and generate a single 
 output file (results labelled by patient id). For example, each 
 patient id has a smoking status.

 Thanks,

 --
 Justin




--
Justin

RE: how to run i2b2 data

2015-08-05 Thread Finan, Sean

Hi Justin,

A shot in the dark:
You could create a collection reader that works similarly to 
org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader , but instead of 
grabbing all of the files in a directory it grabs all the records parsed from a 
single .xml and runs a pipeline per record.  Basically, swap a directory for an 
.xml, a text file for an xml element containing a record.
Somebody out there might have something that already does as much.

Sean

-Original Message-
From: Justin Zhang [mailto:justinzhang...@gmail.com] 
Sent: Wednesday, August 05, 2015 6:40 PM
To: u...@ctakes.apache.org; dev@ctakes.apache.org
Subject: how to run i2b2 data

Hello everyone,

I am running ctakes with i2b2 data
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.i2b2.org_NLP_DataSets_Main.phpd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=IygWj6YGkcjofGRbrDiFJacJHMaBveHR9qzY0VD1AAEs=swpt3QP4-B392iLlJ9wypBwD17tRDOCxPdSZOW1rS8se=
 

In each xml file, there are multiple patient records. I am able to separate 
each patient into single files and process them with runCPE.sh

Is there a way to convert this single xml file into the format ctakes
accepted, and process as a single input file, and generate a single output file 
(results labelled by patient id). For example, each patient id has a smoking 
status.

Thanks,

--
Justin

RE: Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator

2015-07-22 Thread Finan, Sean

Hi Jakob,

The LookupDesc.xml file is supposed to be editable by the user in order to 
enter umls username and password information.  If the file was in a resource 
.jar that would be pretty difficult.  Umls user information can also be 
specified on the command line, so perhaps the whole .xml scenario should be 
rethought.  It could easily be changed as long as users all agree to stick to 
the command-line umls user specification only.

Do you feel like submitting a JIRA item?

Sean


-Original Message-
From: Jakob Rogstadius [mailto:jakob.rogstad...@who-umc.org] 
Sent: Monday, July 20, 2015 4:41 AM
To: dev@ctakes.apache.org
Subject: RE: Cannot resolve lookup descriptor files for 
UmlsDictionaryLookupAnnotator

Hi Sean,

Thanks for your response. I had to work on something else for a couple of days, 
but now I'm back at it.

As you say, I get UmlsDictionaryLookupAnnotator to work when I manually copy 
the files from the subversion repository to my local project. What I have now 
looks like this:
project-name
project-name/src/main/java/...
project-name/data/...
project-name/resources/...
project-name/org/apache/ctakes/dictionary/lookup/... (this folder was 
copied from cTakes svn and is where LookupDesc.xml and the others files are 
located)

However, this doesn't seem like the right approach at all. The other cTakes 
components that I have tried using have all imported neatly as jars from Maven 
central, together with their -res jars which contain the descriptor files and 
other resources that they reference. At no point have I previously downloaded 
the source project from the SVN server, and everything except the UMLS 
dictionary lookups have worked this way.

I am confused. You say that the -res jars are not supposed to contain these 
files, but then what are they supposed to contain? As I mentioned below, the 
current -res jar for UmlsDictionaryLookupAnnotator has no content, except for 
META-INF. And is this really the only way I can get the components to work? 
What am I missing?

In case it matters, I instantiate the annotator as follows using uimaFit:
AggregateBuilder aggregate = new AggregateBuilder();
...

aggregate.add(UmlsDictionaryLookupAnnotator.createAnnotatorDescription());
...
AnalysisEngine aggregateEngine = aggregate.createAggregate();
...
SimplePipeline.runPipeline(reader, aggregateEngine, writer, evaluator);

Best regards,
Jakob

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: den 10 juli 2015 18:29
To: dev@ctakes.apache.org
Subject: RE: Cannot resolve lookup descriptor files for 
UmlsDictionaryLookupAnnotator

Hi Jakob,

The -res jars aren't supposed to contain those files.  The files should be 
placed in the resources/ directory under the ctakes root parallel to lib/.

Can you take me through your checkout / installation and build / run steps?  A 
list of your svn and maven commands might help me figure out what step is 
failing you.

Sean

-Original Message-
From: Jakob Rogstadius [mailto:jakob.rogstad...@who-umc.org] 
Sent: Friday, July 10, 2015 3:04 AM
To: dev@ctakes.apache.org
Subject: RE: Cannot resolve lookup descriptor files for 
UmlsDictionaryLookupAnnotator

Hi Sean,

Many thanks for your reply. Like you say, I see both the lookup descriptors and 
all other resources in the projects on the svn server 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_trunk_d=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=Izx33vJrQxf37pZxy4Ha128D0yl2ak1hSbm4Jp9kX5Es=jkiqjWJUTs0H_ntVqssGJ2R6yWWYlNTVbWR6snNFxAMe=
 ). However, the -res jars that I get through maven are completely empty, 
except for their META-INF folders. For other components, their -res jars do 
contain their resources as expected. Could something have gone wrong while 
publishing recent versions of these two?

These are my relevant maven imports:

dependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup/artifactId
  version3.2.2/version
/dependency
dependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-res/artifactId
  version3.2.2/version
/dependencydependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-fast/artifactId
  version3.2.2/version
/dependency
dependency
  groupIdorg.apache.ctakes/groupId
  artifactIdctakes-dictionary-lookup-fast-res/artifactId
  version3.2.2/version
/dependency

Jar content:
https://urldefense.proofpoint.com/v2/url?u=http-3A__grepcode.com_snapshot_repo1.maven.org_maven2_org.apache.ctakes_ctakes-2Ddictionary-2Dlookup-2Dres_3.2.1_d=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=Izx33vJrQxf37pZxy4Ha128D0yl2ak1hSbm4Jp9kX5Es=eQsXb82VZXQ5MK1KABI7mJVs

RE: Invalid UMLS License

2015-07-27 Thread Finan, Sean

Hi Justin,

The UMLS licensing issue has been resolved: 
https://issues.apache.org/jira/browse/CTAKES-359

Any version built after May 12th 2015 should have the fix.

Sean


-Original Message-
From: Justin Zhang [mailto:justinzhang...@gmail.com] 
Sent: Sunday, July 26, 2015 9:21 AM
To: u...@ctakes.apache.org; dev@ctakes.apache.org
Subject: Invalid UMLS License

Hello Everyone and Sir Miller, Timothy

Has the UMLS license issue discussed in the following link be resolved?

https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_ctakes-2Duser_201505.mbox_-253CE084D8EFE2B03A408B324458C5212E945305DD21-40CHEXMBX3B.CHBOSTON.ORG-253Ed=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=3aIq21IOPN1iyBDOER6I0oZo91kp0ZvFpxVqopVyOjMs=pwjK4pNoPHvoDDd9sK40bk0-_SOQ7MGiA1TNMLplMwIe=
 

-- 

Thanks,

Justin

RE: Invalid UMLS License

2015-07-27 Thread Finan, Sean

 VBG VBN VBP 
VBZ WDT WP WPS WRB

27 Jul 2015 10:39:20  INFO AbstractJCasTermAnnotator - Using minimum term text 
span: 3

27 Jul 2015 10:39:21  INFO DictionaryDescriptorParser - Parsing dictionary
specifications:
/Users/justin/App/eclipse_mars/workspace_eclipse_mars/ctakes/ctakes-dictionary-lookup-fast-res/target/classes/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml

27 Jul 2015 10:39:21  INFO UmlsUserApprover - Checking UMLS Account at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIFaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lbEdKoY0da48z2A6Xnehe0DtLHhe9WSu6whMx1DpeS8s=2Mg1-XF2l5zWbSeV2-H2my6WBXiFuqcHNXpRSy7u-gYe=
  for user zhangjustin -Dctakes.umlspw=20aug10! 
-Djava.util.logging.config.file=/Logger.properties:

.

27 Jul 2015 10:39:21 ERROR UmlsUserApprover -   UMLS Account at
https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIFaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lbEdKoY0da48z2A6Xnehe0DtLHhe9WSu6whMx1DpeS8s=2Mg1-XF2l5zWbSeV2-H2my6WBXiFuqcHNXpRSy7u-gYe=
  is not valid for user myuseraccount -Dctakes.umlspw= 
-Djava.util.logging.config.file=/Logger.properties with CHANGEME


On Mon, Jul 27, 2015 at 8:32 AM, Finan, Sean  
sean.fi...@childrens.harvard.edu wrote:

 Hi Justin,

 The UMLS licensing issue has been resolved:
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org
 _jira_browse_CTAKES-2D359d=BQIFaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSd
 ioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lbEdKoY0da4
 8z2A6Xnehe0DtLHhe9WSu6whMx1DpeS8s=Z5hOF2WyiKrmPoizIO9D9lYAMHqRyMSHsKl
 gxXunPY4e=

 Any version built after May 12th 2015 should have the fix.

 Sean


 -Original Message-
 From: Justin Zhang [mailto:justinzhang...@gmail.com]
 Sent: Sunday, July 26, 2015 9:21 AM
 To: u...@ctakes.apache.org; dev@ctakes.apache.org
 Subject: Invalid UMLS License

 Hello Everyone and Sir Miller, Timothy

 Has the UMLS license issue discussed in the following link be resolved?


 https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.ap
 ache.org_mod-5Fmbox_ctakes-2Duser_201505.mbox_-253CE084D8EFE2B03A408B3
 24458C5212E945305DD21-40CHEXMBX3B.CHBOSTON.ORG-253Ed=BQIBaQc=qS4goWB
 T7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcp
 KGd4f7d4gTaom=3aIq21IOPN1iyBDOER6I0oZo91kp0ZvFpxVqopVyOjMs=pwjK4pNoP
 HvoDDd9sK40bk0-_SOQ7MGiA1TNMLplMwIe=

 --

 Thanks,

 Justin




--
Justin

1 2 3 4 5 6 7 8 9 >

1 - 100 of 815 matches

Mail list logo