RE: to involve in your development group
Hi Sandeep, I just took a peek at the JavaOcr code, and it looks like they perform image filtering in the PixelImage class. This would probably cause a problem with dot matrix images as every corner of every dot would be removed as noise, so dots that participate in curves on characters such as P would be removed to form something more like |'. In fact, depending upon the spacing between matrix dots and the resolution of the scan, the filter could decrease the size of each dot, making it very difficult for the ocr to work at all. Assuming that you have already tried to train the software using your dot matrix printings, you could change JavaOcr to use java advanced imaging (jai). You would then use the jai Raster class instead of the javaocr PixelImage class for image manipulation. There are a lot of things that you could do from that point forward. Just giving you my initial thought, Sean -Original Message- From: sandeep rg [mailto:sandeep.f...@gmail.com] Sent: Monday, July 22, 2013 10:06 AM To: dev@ctakes.apache.org Subject: Re: to involve in your development group sir, i have gone through some of the medical record such as bills,patient details etc. most of them are printed using dot matrix printer,which is very hard to extract such type text from scanned images.i have done testing with some professional software such as abbyy fine reader which also given a poor output. but sir i have the confidence to do it.but i need more knowledge about image processing capabilities.so can you suggest any one who is good in image processing programming in your team? On Thu, Jul 18, 2013 at 1:22 AM, sandeep rg sandeep.f...@gmail.com wrote: i hava done sequence diagram and done some small changes,please go through it and tell me if any more thing is to be included On Wed, Jul 17, 2013 at 9:37 PM, sandeep rg sandeep.f...@gmail.comwrote: it just a skeleton of original proposal On Wed, Jul 17, 2013 at 9:31 PM, sandeep rg sandeep.f...@gmail.comwrote: the sample work is shared with you both.any more details to be included please tell me. In which,GUI design,schedule and implementation flow chart design is to added which is under construction and will be uploaded within few hours. On Wed, Jul 17, 2013 at 7:56 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: pei.stat...@gmail.com -Original Message- From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, July 17, 2013 10:22 AM To: dev@ctakes.apache.org Subject: Re: to involve in your development group chris.mattm...@gmail.com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: sandeep rg sandeep.f...@gmail.com Reply-To: dev@ctakes.apache.org dev@ctakes.apache.org Date: Wednesday, July 17, 2013 6:53 AM To: dev@ctakes.apache.org dev@ctakes.apache.org Subject: Re: to involve in your development group can you provide your gmail id to share the proposal document with you? On Tue, Jul 16, 2013 at 11:33 PM, sandeep rg sandeep.f...@gmail.com wrote: sir, i am providing proposal by two days.now i am mainly going through ASF-ICFOSS gateway because if i gone through their way and my proposal is get selected,ICFOSS will provide some sort of support such as certificates,small financial support etc. to us. but,main thing is i like programming,i like to explore through the new technologies in coding and like to interact with the coding.so if my proposal is got rejected,then also i like to work in your project as a volunteer if you allow me.. now i am preparing a proposal,within 2 days i will submit it..Mattmann chris helped me to know more about the format of proposal. On Tue, Jul 16, 2013 at 8:12 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: Chris/Sandeep, According to ASF-ICFOSS, I believe the deadline for submitting proposals is this coming Friday (July 19). After which point, mentors will have 2 weeks to review and score/accept. Just curious, are we planning to follow the same process here? Or since it's all volunteer work, technically- sandeep and still contribute code to the community and participate in the dev group here. Looking forward to it. --Pei -Original Message- From: sandeep rg [mailto:sandeep.f...@gmail.com] Sent: Monday, July 15, 2013 1:05 PM To:
RE: cTAKES user interface
Sean Finan (I think is on this group) already wrote a command line CPE runner like Pei described. I am in this group, and I have written a very simple cli cpe runner. As Pei mentioned: The problem is that most of us who are already familiar with the nitty gritty are probably doing this with some sort of custom scripts or solution. The class that I have is probably not doing anything that others are not - in fact, I'm sure that I used somebody else's code as a template as I am not that familiar with Senior Nitty Gritty. I committed (Trunk, 1537124) a class named CmdLineCpeRunner.java to ctakes-utils in package ...utils.cpe It was so quick 'n dirty that there isn't any documentation, no logging, etc. but it gets the job done. It takes a path to a cpe.xml file as an argument and simply runs the pipeline specified therein. I suggest that James has the correct startup approach: However you need to have a classpath set properly. To accomplish that, you could try copying runctakesCPE.bat or runctakesCPE.sh and within the script file, replacing org.apache.uima.tools.cpm.CpmFrame with [CLASS TO CALL] To the best of my knowledge the easiest way to create the cpe.xml file is probably to run through the gui once, setting up the pipeline and saving the xml - but run through at least once to make certain that the pipeline works. Enjoy, Sean -Original Message- From: Lingren, Todd [mailto:todd.ling...@cchmc.org] Sent: Wednesday, October 30, 2013 9:52 AM To: dev@ctakes.apache.org Cc: Finan, Sean Subject: RE: cTAKES user interface Hi all, Sean Finan (I think is on this group) already wrote a command line CPE runner like Pei described. I've been using it and would be happy to provide some user guides if he provides the class,etc. Todd Lingren Biomedical Informatics Cincinnati Children's Hospital todd.ling...@cchmc.org 513-803-9032 -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Tuesday, October 29, 2013 9:56 PM To: dev@ctakes.apache.org Subject: Re: cTAKES user interface Thanks William and Richard, those are both really excellent pointers. Tim On 10/29/2013 07:58 PM, William Karl Thompson wrote: Nice! +1 for Groovy. It's like being able to program in Python again. -Original Message- From: Richard Eckart de Castilho [mailto:r...@apache.org] Sent: Tuesday, October 29, 2013 5:49 PM To: dev@ctakes.apache.org Subject: Re: cTAKES user interface Maven allows to do marvelous things on the CLI, provided you throw in an additional component: Groovy. We did some amazing self-contained Groovy scripts with uimaFIT and DKPro Core which you might find interesting http://code.google.com/p/dkpro-core-asl/wiki/DKProGroovyCookbook -- Richard On 29.10.2013, at 23:09, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: I think this is also an area where Maven integration was a small step backwards (I greatly appreciate the steps forward it allowed). I used to run stuff from the command line and in scripts more often but it's slightly less straightforward setting up the classpath with maven -- before you could put a simple java -cp lib/*.jar class name in a script, now I'm not sure how to go about it using maven. I'm sure there's a way, but I am afraid of falling down the maven rabbit hole. Tim On Oct 29, 2013, at 5:53 PM, Chen, Pei wrote: +1 Pan, the short answer is yes- it can be done in CLI. The problem is that most of us who are already familiar with the nitty gritty are probably doing this with some sort of custom scripts or solution. Cc' the dev group to get a fresh perspective; not sure what the easiest would be-- run the CPE via command line with default input/output directories or running a Driver Main Class as part of examples. --Pei
RE: cTAKES user interface
Well, thanks to my not checking the utils pom (or building trunk since I'm currently still in incubator), I made Jenkins angry. Instead of adding uima as a dependency to ctakes-utils, I moved the cpe cli to ctakes-core. I hope that works. My apologies to anybody that checked out in the last hour. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, October 30, 2013 11:20 AM To: Lingren, Todd; dev@ctakes.apache.org Subject: RE: cTAKES user interface Sean Finan (I think is on this group) already wrote a command line CPE runner like Pei described. I am in this group, and I have written a very simple cli cpe runner. As Pei mentioned: The problem is that most of us who are already familiar with the nitty gritty are probably doing this with some sort of custom scripts or solution. The class that I have is probably not doing anything that others are not - in fact, I'm sure that I used somebody else's code as a template as I am not that familiar with Senior Nitty Gritty. I committed (Trunk, 1537124) a class named CmdLineCpeRunner.java to ctakes-utils in package ...utils.cpe It was so quick 'n dirty that there isn't any documentation, no logging, etc. but it gets the job done. It takes a path to a cpe.xml file as an argument and simply runs the pipeline specified therein. I suggest that James has the correct startup approach: However you need to have a classpath set properly. To accomplish that, you could try copying runctakesCPE.bat or runctakesCPE.sh and within the script file, replacing org.apache.uima.tools.cpm.CpmFrame with [CLASS TO CALL] To the best of my knowledge the easiest way to create the cpe.xml file is probably to run through the gui once, setting up the pipeline and saving the xml - but run through at least once to make certain that the pipeline works. Enjoy, Sean -Original Message- From: Lingren, Todd [mailto:todd.ling...@cchmc.org] Sent: Wednesday, October 30, 2013 9:52 AM To: dev@ctakes.apache.org Cc: Finan, Sean Subject: RE: cTAKES user interface Hi all, Sean Finan (I think is on this group) already wrote a command line CPE runner like Pei described. I've been using it and would be happy to provide some user guides if he provides the class,etc. Todd Lingren Biomedical Informatics Cincinnati Children's Hospital todd.ling...@cchmc.org 513-803-9032 -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Tuesday, October 29, 2013 9:56 PM To: dev@ctakes.apache.org Subject: Re: cTAKES user interface Thanks William and Richard, those are both really excellent pointers. Tim On 10/29/2013 07:58 PM, William Karl Thompson wrote: Nice! +1 for Groovy. It's like being able to program in Python again. -Original Message- From: Richard Eckart de Castilho [mailto:r...@apache.org] Sent: Tuesday, October 29, 2013 5:49 PM To: dev@ctakes.apache.org Subject: Re: cTAKES user interface Maven allows to do marvelous things on the CLI, provided you throw in an additional component: Groovy. We did some amazing self-contained Groovy scripts with uimaFIT and DKPro Core which you might find interesting http://code.google.com/p/dkpro-core-asl/wiki/DKProGroovyCookbook -- Richard On 29.10.2013, at 23:09, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: I think this is also an area where Maven integration was a small step backwards (I greatly appreciate the steps forward it allowed). I used to run stuff from the command line and in scripts more often but it's slightly less straightforward setting up the classpath with maven -- before you could put a simple java -cp lib/*.jar class name in a script, now I'm not sure how to go about it using maven. I'm sure there's a way, but I am afraid of falling down the maven rabbit hole. Tim On Oct 29, 2013, at 5:53 PM, Chen, Pei wrote: +1 Pan, the short answer is yes- it can be done in CLI. The problem is that most of us who are already familiar with the nitty gritty are probably doing this with some sort of custom scripts or solution. Cc' the dev group to get a fresh perspective; not sure what the easiest would be-- run the CPE via command line with default input/output directories or running a Driver Main Class as part of examples. --Pei
RE: Sundry; Problem Lists
Hi John, I hope that you didn't think that I was belittling your ideas or saying that anything has been done (and done). I was just throwing in two resources for further thought. You have brought forward some great applications for cTakes and nlp! Sean From: John Green [john.travis.gr...@gmail.com] Sent: Thursday, October 31, 2013 7:26 PM To: dev@ctakes.apache.org Subject: RE: Sundry; Problem Lists Last point: I seem to be interested in a current encounter (the now) and diagnosis, the article seems to be interested in an arguably just as useful tool, the longitudinal problem list (the ever), though very different I would think in approach. Thoughts? Jg — Sent from Mailbox for iPhone On Thu, Oct 31, 2013 at 7:22 PM, John Green john.travis.gr...@gmail.com wrote: Sean - quick note: after looking at the above two resources, a couple of points. The first resource confirms what I expected, that the vocabulary exists in ctakes. The second confirms what I suspected: that novel approaches to ordering and identification of top members of a problem list are needed. Namely, that the vocabulary may be there, but thats only a tenth of the battle. Your second great resource you sent me acknowledges this - that prioritization, eg enumeration from most important to least, as well as clumping, are the true battle. A point of clarification on my end: it would be interesting to see what could be added on top of existing ctakes in order to facilate a solution to the second problem - clumping and prioritizing. (For instance, from the second article, an acute process may have nothing todo with the past medical history and if an algorithm were concerned with all members as equals, it would miss the issue at hand). Just as a thought: working back from the known natural history of diseases would possibly be a route to a solution. This is probably well known stuff, so please forgive my ignorance if its all been done/thought of before. Again, the two links were very helpful, thank you. Jg — Sent from Mailbox for iPhone On Thu, Oct 31, 2013 at 2:04 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: I don't know if what I write below truly applies to the discussion, but here it is. much of a problem list definition may already be contained to varying degrees in existing cTakes databases. The UMLS does provide a problem list, but I haven't looked at it. http://www.nlm.nih.gov/research/umls/Snomed/core_subset.html This might be a paper of interest to you: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655994/ It discusses the use of nlp to create something like a problem list. Sean From: John Green [john.travis.gr...@gmail.com] Sent: Thursday, October 31, 2013 12:02 PM To: dev@ctakes.apache.org Subject: Re: Sundry Pei and Tim - Good questions. The bottom line is that OPQRST is the algorithm that every clinician uses to characterize the history of a sign, symptom or constellation of symptoms. Each letter has multiple meanings, but generally they're grouped. O for onset, was it quick or slow in onset, P for palliative or provoking phenomenon, that is, does tylenol make it better? Does it feel better when you lean forward? Is it worse with standing? Q is the quality, generally, though I could give more examples of each Ill keep it brief from here, R is generally region or radiation of the pain and or sign, S is the severity, and T is the time course, is it intermittent? When it happens, how long does it last for? I could send documents used to teach new clinicians to better comprehend for anyone interested. OPQRST, while most residents would assume it is only for teaching new clinicians, as Tim said, is a useful tool at all levels. Great clinicians, and I work with some great senior folks, use this everyday. The idea that it is only for teaching is founded on two things: one, that it doubles as a structured mnemonic for characterizing signs and symptoms and two, that everyone so far ingrains this into their clinical skill set, unless they are geared toward teaching, they, after the basic level, never think about it again! Caveat: many good clinicians will tell you to keep it algorithmic so that you're systematic and do not overlook details. What is it's application to ML? Obviously the furthest desired end-state for NLP like cTakes would be understanding a clinical encounter to such a nuanced level that detailed diagnoses could be considered along with treatment plans. While I only know what I've read in Artificial Intelligence: A Modern Approach and picked up from friends over the years who were good knowledgeable in this field, I feel that OPQRST would be a huge benefit toward beginning to outline the problem of more rigorous ML characterization of the clinical narrative. The utility of OPQRST may not still be entirely clear to those who have
RE: specificity in selecting EntityMentions when using AggregatePlaintextUMLSProcessor
Hi Ted, In addition to performing searches, the hyperSql ( http://hsqldb.org/ ) database tool should allow you to perform inserts into the umls dictionary database used by cTakes. You can also create your own customized dictionary and run cTakes using only that dictionary or with umls plus that dictionary. There are several ways to create a custom dictionary, and I think that you can start by looking in the resources/ ... /dictionary/lookup/ directory for examples. It can be a little overwhelming if you just want to add one or two terms, and I am in the process of trying to make this a little easier for any user. It may be a while before I can add my work to the trunk. Until then, if you decide to go with the csv approach you can probably make it through with the examples in cTakes resources. If you want to create a new hsql database then I can send you my (old) instructions on that process - but it might be overkill. If you really want to know what lies behind the mask of the cTakes umls dictionary then I highly recommend that you just interface with it directly using the hsql tool. Sean From: Assur, Ted [theodore.as...@providence.org] Sent: Friday, November 01, 2013 5:36 PM To: dev@ctakes.apache.org Subject: RE: specificity in selecting EntityMentions when using AggregatePlaintextUMLSProcessor OK, Kind of resurfacing the original topic on this one, after I redirected it towards ICD codes last month: I have several examples, like the one below, where it would be very helpful to be able to include UMLS terms that are in the UMLS 2011AB release, e.g. CIN 1 (CUI = C0349458). So if I have particular UMLS concepts I want to make sure and include, is there a way for me to *add* them to the umls dictionary used by cTAKES? Ted -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, September 04, 2013 9:37 AM To: dev@ctakes.apache.org Subject: RE: specificity in selecting EntityMentions when using AggregatePlaintextUMLSProcessor I don't know if this is exactly what you want, but you can use the hyperSql ( http://hsqldb.org/ ) database tool to perform searches on the umls dictionary used by cTakes. For instance select * from UMLS_MS_2011AB where FWORD = 'CIN' will provide all the available terms starting with CIN. In the result you'll see that there is no term CIN I, and you'll also see that the only listing from ICD9 is for CIN III [C0851140, T191, MTHICD9 233.1] If you want an icd9 code that isn't in the cTakes umls dictionary then you can find it online ... but that won't do you much good wrt cTakes. Sean -Original Message- From: Assur, Ted [mailto:theodore.as...@providence.org] Sent: Wednesday, September 04, 2013 11:56 AM To: dev@ctakes.apache.org Subject: RE: specificity in selecting EntityMentions when using AggregatePlaintextUMLSProcessor Thanks for looking into this, it's been puzzling me. On another note, I know the cTAKES dictionary uses ICD9, but I'm not familiar with how to access that information: In the example I've described below, where would I locate the ICD9 for a specific entity? Thank you Ted -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Tuesday, September 03, 2013 7:13 PM To: dev@ctakes.apache.org Subject: Re: specificity in selecting EntityMentions when using AggregatePlaintextUMLSProcessor You're right, it should have gotten CIN I- that's a strange one, probably needs to be debugged/looked into further... On Tue, Sep 3, 2013 at 10:05 PM, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: Ah. So it will get CIN 2 (in SNOMED) CIN III (in SNOMED) CIN 3 (in SNOMED) but the rest are not in SNOMED? I wonder why it doesn't get CIN I? It looks like that exists in SNOMED (though I don't fully understand what all the symbols mean in the umls browser). CIN I - Cervical intraepithelial neoplasia 1 [A3002690/SNOMEDCT/SY/285836003] On 09/03/2013 09:55 PM, Pei Chen wrote: It has the correct parse (POS, chunks, and lookupwindow)- but some of the terms do not exist in SNOMED- CIN 2 - Cervical intraepithelial neoplasia 2 [A3002688/SNOMEDCT/SY/285838002] exists but not CIN II. CIN III [A965/SNOMEDCT/SY/20365006] also exists that's why it was able to perform the lookup successfully. Note that CIN II synonyms do exist in other umls thersauses such as MEDCIN, CCPSS though. However, the bundled cTAKES dictionaries only contain (MeSH, SNOMEDCT, RxNORM, NCI, ICD9) IRRC. --Pei On Tue, Sep 3, 2013 at 9:44 PM, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: That is a good question, Ted! I tried it with a simple context: The patient has a CIN III. I'm not sure if that is a correct context but I was able to duplicate your findings. (Finds a CUI for CIN III but not if you change it to CIN II) My first thought was that it is the chunker. But the chunker seems to get
RE: Sundry; Problem Lists
Excellent! By the by, I know next to nothing about nlp - I'm just a software developer that (for some reason) jumped down this (nlp) particular rabbit hole. When it comes to nlp background, research, state and direction I'm hoping that somebody much more knowledgable than I will jump in. after a thorough pubmed search, no one seems to have tried to build problem lists for ACUTE encounters, only as extensions to a past medical history I''m really glad that we have a truly novel road on which to travel. I seem to be interested in a current encounter (the now) [as opposed to] the longitudinal problem list (the ever). I think that is a great as both a challenge and possible tool, as well as your thought on prioritization, eg enumeration from most important to least, as well as clumping I briefly discussed the first idea (acute vs. historical) with another physician (after you brought it up) and there was concurrency that such a feature would be extremely useful - if not completely necessary for any real clinical use of nlp. I think that if temporal parsing ever becomes finite enough with respect to the time of an event relative to the time of the note (DocTimeRel) or with proper narrative containers, then this becomes a possible use case. I mention this in a weak attempt to pull the nlpers into the discussion ... This is probably well known stuff Bad assumption ... insert emoticon here ... working back from the known natural history of diseases would possibly be a route to a solution. Now that is a challenge! Cheers for the inspiration and enthusiasm, Sean From: John Green [john.travis.gr...@gmail.com] Sent: Monday, November 04, 2013 10:45 AM To: Finan, Sean Subject: RE: Sundry; Problem Lists Oh goodness no, I didnt think that at all! Im so new to the field of NLP, anything and everything helps and is appreciated. Heck, im just now learning to understand Markov chains. An additional thought: after a thorough pubmed search, no one seems to have tried to build problem lists for ACUTE encouters, only as extensions to a past medical history. I think this would be a very fruitful avenue. It could easily be scored against a gold standard medical resident list for a few hundred patients across depth and acuity. Just thinkin out loud, bouncing ideas off those who know more than I! Jg — Sent from Mailboxhttps://www.dropbox.com/mailbox for iPhone On Mon, Nov 4, 2013 at 9:24 AM, Finan, Sean sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu wrote: Hi John, I hope that you didn't think that I was belittling your ideas or saying that anything has been done (and done). I was just throwing in two resources for further thought. You have brought forward some great applications for cTakes and nlp! Sean From: John Green [john.travis.gr...@gmail.com] Sent: Thursday, October 31, 2013 7:26 PM To: dev@ctakes.apache.org Subject: RE: Sundry; Problem Lists Last point: I seem to be interested in a current encounter (the now) and diagnosis, the article seems to be interested in an arguably just as useful tool, the longitudinal problem list (the ever), though very different I would think in approach. Thoughts? Jg — Sent from Mailbox for iPhone On Thu, Oct 31, 2013 at 7:22 PM, John Green john.travis.gr...@gmail.com wrote: Sean - quick note: after looking at the above two resources, a couple of points. The first resource confirms what I expected, that the vocabulary exists in ctakes. The second confirms what I suspected: that novel approaches to ordering and identification of top members of a problem list are needed. Namely, that the vocabulary may be there, but thats only a tenth of the battle. Your second great resource you sent me acknowledges this - that prioritization, eg enumeration from most important to least, as well as clumping, are the true battle. A point of clarification on my end: it would be interesting to see what could be added on top of existing ctakes in order to facilate a solution to the second problem - clumping and prioritizing. (For instance, from the second article, an acute process may have nothing todo with the past medical history and if an algorithm were concerned with all members as equals, it would miss the issue at hand). Just as a thought: working back from the known natural history of diseases would possibly be a route to a solution. This is probably well known stuff, so please forgive my ignorance if its all been done/thought of before. Again, the two links were very helpful, thank you. Jg — Sent from Mailbox for iPhone On Thu, Oct 31, 2013 at 2:04 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: I don't know if what I write below truly applies to the discussion, but here it is. much of a problem list definition may already be contained to varying degrees in existing cTakes databases
RE: cTAKES Groovy...
Good stuff - Thanks Richard -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Friday, December 06, 2013 3:30 PM To: 'dev@ctakes.apache.org' Subject: RE: cTAKES Groovy... Thanks Richard! That did the trick I'll create a JIRA and update the script including adding a comment that that @GrabResolver is only needed for pre-OpenNLP 1.5.3 and should be removed when we upgrade to 1.5.3+. and I'll update CTAKES-191 Update Apache OpenNLP dependency to 1.5.3 with a reminder to update the script. Trunk of cTAKES still uses 1.5.2-incubating -Original Message- From: dev-return-2297-Masanz.James=mayo@ctakes.apache.org [mailto:dev-return-2297-Masanz.James=mayo@ctakes.apache.org] On Behalf Of Richard Eckart de Castilho Sent: Friday, December 06, 2013 2:12 PM To: dev@ctakes.apache.org Subject: Re: cTAKES Groovy... On 06.12.2013, at 18:01, Masanz, James J. masanz.ja...@mayo.edu wrote: I have not solved my issues on my ubuntu server yet where Error grabbing Grapes -- [unresolved dependency: jwnl#jwnl;1.3.3: not found] This has also already been fixed in OpenNLP 1.5.3, so there must be some dependency on OpenNLP 1.5.(1|2)-incubating. Anyway, you should be able to fix it by adding this to the beginning of your Groovy script, in front of the Grapes: @GrabResolver(name='opennlp.sf.net', root='http://opennlp.sourceforge.net/maven2') -- Richard
RE: UMLS Env variables suggestion
+1 -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Monday, January 06, 2014 10:57 AM To: dev@ctakes.apache.org Subject: RE: UMLS Env variables suggestion Sounds like a good idea; we can just update all of the documentation/scripts to use underscore (_), and leave the dot (.) in the code to be deprecated for now? --Pei -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Saturday, January 04, 2014 10:10 PM To: dev@ctakes.apache.org Subject: RE: UMLS Env variables suggestion This went in to 3.1 https://issues.apache.org/jira/browse/CTAKES-164 I agree - the docs need to be updated if there is consensus on the use of this method. Personally I think that there should be one supported method, not both dot and underscore. I would prefer that we remove the dot functionality since it is not operational across all environments, but it isn't up to me alone to remove functionality. -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Saturday, January 04, 2014 4:08 PM To: dev@ctakes.apache.org Cc: dev@ctakes.apache.org Subject: Re: UMLS Env variables suggestion I believe Sean updated the code to also support underscore (_) as well. But the docs just need to be updated... On Jan 4, 2014, at 4:04 PM, Dewful dew...@gmail.com wrote: In the documentation, in the .sh files to run ctakes; # If you plan to use the UMLS Resources, set/export env variables # export ctakes.umlsuser=[username], ctakes.umlspw=[password] however, simply trying to export ctakes.umlsuser=myusername, ctakes.umlspw=mypassword doesnt work because bash3 doesnt allow dots in the keyname and will throw an error bin/runctakesCVD.sh: line 42: export: `ctakes.umlsuser=username,': not a valid identifier http://stackoverflow.com/questions/15016403/how-to-export-dot- separate d-environment-variablesexplains some solutions it may be helpful to show how the user can set these easily if they want to set the env variables this way, possibly using one of the suggestions in SO. N
RE: sentence detector newline behavior
On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions: Advil Tylenol No Aspirin However, when it comes to the problem that you mention, there is no benefit to a Line. The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line (or no punctuation at all at the end of the line). And in those sections, negation can be important. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would run together sentences from different lines. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one direction or the other, we can take a poll of when where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-) If some capabilities like negation detection require -lines- then would it make more sense to have Sentence ignore -newline- and negation detection itself split the Sentence into line items? If an annotator is interested in list items, each of which may be on a distinct -line-, then it can split up the Sentence as needed. I think that James hints that cTakes code already does this in some places. If a good deal of functionality requires -newline- delimited types, would it make sense to introduce a type Line? If something uses a structured list it could iterate through Line types, while something using pure text could iterate through Sentence types. This facilitates section-by-section different behavior, does not require any decision on global defaults, and makes data selection for training Sentence a nonesuch wrt line breaks. However, it adds to the system and would require a per-use choice decision by developers OR a toggle by users (back to the default decision). Perhaps this has already been tried? Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 1:06 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior The only rule I know of is that cTAKES (prior to ytex integration) always forces a sentence break at a newline. This was because the clinical notes cTAKES original processed never had newlines in the middle of a sentence, but did need sentence breaks to occur at end of sentence for good negation detection on those notes. I think Guergana earlier mentioned other EMRs also have this need, but it seems to not be ubiquitous. From others' posts, it seems that we could use an option in cTAKES to turn off this forcing of sentence breaks at newlines (or depending on how you look at it, an option to turn on the forcing of sentence breaks if we change the default behavior) I think we
RE: YTEX cTAKES 3.1.1 ready
Hi Vijay, I have yet to run across clinical text from a real EMR where newlines represent the end of a sentence Since James pointed out this possibility a couple weeks ago, I have kept my eyes open. The problem is pretty ubiquitous in a corpus that I'm working with right now. I just opened the first note and gave it a count ... 95 lines total, 9 are sentence/phrase (lacking punctuation) endings. This is not including lists, which comprise about half of the note. One possible conjoinment was Will consider [...] biopsy\nGiven [...]. Depending upon how cTakes deals with it, the meaning could change drastically. I believe cTAKES absolutely has to support sentences with newlines within them Yes, cTakes should do so, but I hope that you aren't suggesting that it only support such a structure. Where is that easy button? -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Thursday, February 06, 2014 10:31 AM To: dev@ctakes.apache.org Cc: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; vlad.valtchi...@gmail.com Subject: Re: YTEX cTAKES 3.1.1 ready I believe it is worth migrating to trunk. Note that the sentence detector is also complementary - the existing ctakes sentence detector is unchanged - users can choose which sentence detector to use. There are changes to assertion dependency parsing to support sentences without newlines, and that works with both sentence detectors. I believe cTAKES absolutely has to support sentences with newlines within them - I have yet to run across clinical text from a real EMR where newlines represent the end of a sentence - the changes to assertion dependency parsing will have to be done at some point. -vj On Thu, Feb 6, 2014 at 10:19 AM, Chen, Pei pei.c...@childrens.harvard.eduwrote: VJ, Aside from the changes to the existing cTAKES code (sentence detector, etc.) [which we could leave out if it's still being debated], Do you think it's worth migrating the ytex code to trunk at this point? As you mentioned earlier, it's largely complementary. [I was just thinking of saving effort to maintain the separate branch and for simplicity for dev...] --Pei -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Wednesday, February 05, 2014 9:30 PM To: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; vlad.valtchi...@gmail.com Subject: Re: YTEX cTAKES 3.1.1 ready Hi Vlad, I Updated the umls install guide; see https://code.google.com/p/ytex/wiki/UMLS_SQL_SERVER_3_1 I would prefer to add the docs in the ctakes confluence, but as far as I can tell, I don't have write access there - can somebody give me write privileges on the ctakes confluence site? There was a bug in the umls install; copy https://svn.apache.org/repos/asf/ctakes/branches/ytex/ctakes- ytex/scripts/data/build.xmlover the corresponding file in your ctakes-3.1.2 install (CTAKES_HOME\bin\ctakes-ytex\scripts\data) and you should be set. The import is currently running on the UMLS 2013AA (I assume this will complete without issues as long as the umls schema hasn't changed from 2012). what trial and error did you have to go through to build the distro? -vj On Wed, Feb 5, 2014 at 5:33 PM, vijay garla vnga...@gmail.com wrote: Hi Vlad, sorry that the instructions aren't clear. re 1) What I am trying to say is install apache-ctakes-3.2.0-snapshot as usual (this is unchanged from 3.1.1). After that you still have to apply the lib and resources (these are things that cannot be distributed via apache). re 2) Yes, I need to update those docs. Hopefully will get to that at some point. However, I assume you already have a UMLS DB (also assume SQL Server). If you can't/don't want to use your existing umls DB, please tell me. The I'll priortize upgrading the doc on importing the umls tables (the scripts are there). best, VJ On Wed, Feb 5, 2014 at 4:44 PM, vlad.valtchi...@gmail.com wrote: Hi VJ- so, with trial and error were able to make the distribution and now have the apache-ctakes-3.1.2-SNAPSHOT-bin.zip archive. Here's what's unclear. 1. Is now this the only (combined) thing that you need for ctakes 3.1.1 + Ytex? the current documentation (https://code.google.com/p/yte x/wiki/Installation_cTAKES_3_1?ts=1388793998updated=Instal lation_cTAKES_3_1) which most probably is outdated, talks about installing cTakes 3.1.1 first and then applying 2 SNAPSHOT archives (downloadable) , lib and resources. This is a confusion point. 2. The directions to import UMLS subset are then outdated as well. Maybe one should use the old version (ctakes 2.5 and ytex 0.8) to import the RRF files for the UMLS subset and then just use the resulting db. Thoughts? Thanks, Vlad Valtchinov Brigham Rad On Thursday,
RE: YTEX cTAKES 3.1.1 ready
Right, got it. I just wanted to let you know that some EMR notes -do- require sentence splitting at newline characters. -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Thursday, February 06, 2014 1:06 PM To: dev@ctakes.apache.org Cc: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; vlad.valtchi...@gmail.com Subject: Re: YTEX cTAKES 3.1.1 ready The cTAKES sentence detector is not changed in the YTEX branch. The YTEX branch has an *additional* sentence detector that does not automatically split sentences on newlines - users can use this if they like. -vj On Thu, Feb 6, 2014 at 1:01 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Vijay, I have yet to run across clinical text from a real EMR where newlines represent the end of a sentence Since James pointed out this possibility a couple weeks ago, I have kept my eyes open. The problem is pretty ubiquitous in a corpus that I'm working with right now. I just opened the first note and gave it a count ... 95 lines total, 9 are sentence/phrase (lacking punctuation) endings. This is not including lists, which comprise about half of the note. One possible conjoinment was Will consider [...] biopsy\nGiven [...]. Depending upon how cTakes deals with it, the meaning could change drastically. I believe cTAKES absolutely has to support sentences with newlines within them Yes, cTakes should do so, but I hope that you aren't suggesting that it only support such a structure. Where is that easy button? -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Thursday, February 06, 2014 10:31 AM To: dev@ctakes.apache.org Cc: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; vlad.valtchi...@gmail.com Subject: Re: YTEX cTAKES 3.1.1 ready I believe it is worth migrating to trunk. Note that the sentence detector is also complementary - the existing ctakes sentence detector is unchanged - users can choose which sentence detector to use. There are changes to assertion dependency parsing to support sentences without newlines, and that works with both sentence detectors. I believe cTAKES absolutely has to support sentences with newlines within them - I have yet to run across clinical text from a real EMR where newlines represent the end of a sentence - the changes to assertion dependency parsing will have to be done at some point. -vj On Thu, Feb 6, 2014 at 10:19 AM, Chen, Pei pei.c...@childrens.harvard.eduwrote: VJ, Aside from the changes to the existing cTAKES code (sentence detector, etc.) [which we could leave out if it's still being debated], Do you think it's worth migrating the ytex code to trunk at this point? As you mentioned earlier, it's largely complementary. [I was just thinking of saving effort to maintain the separate branch and for simplicity for dev...] --Pei -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Wednesday, February 05, 2014 9:30 PM To: ytex-us...@googlegroups.com; ctakes-...@incubator.apache.org; vlad.valtchi...@gmail.com Subject: Re: YTEX cTAKES 3.1.1 ready Hi Vlad, I Updated the umls install guide; see https://code.google.com/p/ytex/wiki/UMLS_SQL_SERVER_3_1 I would prefer to add the docs in the ctakes confluence, but as far as I can tell, I don't have write access there - can somebody give me write privileges on the ctakes confluence site? There was a bug in the umls install; copy https://svn.apache.org/repos/asf/ctakes/branches/ytex/ctakes- ytex/scripts/data/build.xmlover the corresponding file in your ctakes-3.1.2 install (CTAKES_HOME\bin\ctakes-ytex\scripts\data) and you should be set. The import is currently running on the UMLS 2013AA (I assume this will complete without issues as long as the umls schema hasn't changed from 2012). what trial and error did you have to go through to build the distro? -vj On Wed, Feb 5, 2014 at 5:33 PM, vijay garla vnga...@gmail.com wrote: Hi Vlad, sorry that the instructions aren't clear. re 1) What I am trying to say is install apache-ctakes-3.2.0-snapshot as usual (this is unchanged from 3.1.1). After that you still have to apply the lib and resources (these are things that cannot be distributed via apache). re 2) Yes, I need to update those docs. Hopefully will get to that at some point. However, I assume you already have a UMLS DB (also assume SQL Server). If you can't/don't want to use your existing umls DB, please tell me. The I'll priortize upgrading the doc on importing the umls tables (the scripts are there). best, VJ On Wed, Feb 5, 2014 at 4:44 PM, vlad.valtchi...@gmail.com wrote: Hi VJ- so, with trial and error were able to make the distribution and now
RE: Update: UMLS, cTAKES, and UIMA for applications in genomics
Hi Andy, We have been using Uima-as here, but with no third-party wrappings. We have set it up to run in standalone and lsf cluster environments, but everything is out-of-box with a few custom bash scripts to set environment settings, etc. Sean -Original Message- From: andy mcmurry [mailto:mcmurry.a...@gmail.com] Sent: Monday, February 24, 2014 2:16 PM To: dev@ctakes.apache.org Subject: Update: UMLS, cTAKES, and UIMA for applications in genomics Hi all: I'm writing to update about my efforts to make a cTAKES out of the box VM with UMLS support. My specific use cases are for annotating DNA test results, so both publication text and patient notes are important towards this goal. cTAKES VM. = Wrote bash scripts to download and install cTAKES on Ubuntu. Will provide interfaces for REST endpoints for each service (Clojure). UMLS Services Wrote Clojure/REST services for invoking the MetamapAPI and looking up concepts/synonym entries in the UMLS. Will do the same for cTAKES in the upcoming months. Semantic Representation Is anyone else using UMLS SemRep http://semrep.nlm.nih.gov/? These annotations provide secondary evidence for the cTAKES medication and co-reference parsers, as well as additional annotations for other semantic types. Genetic variant parser (HGVS) Reece Hart released a standard HGVS parserhttps://bitbucket.org/invitae/hgvswhich I intend to include in the VM distribution as an optional UIMA pipeline (callout REST service). Scalability: UIMA Async Scaleout with Fit = I'm planning on using Clojuima https://github.com/jimpil/clojuima to scale at my company. Is everyone else using UIMA-AS as well, or planning to?
RE: How to add a new dictionary database to cTAKES
Hi Abhishek, You have some interesting timing ... I can give you the xml specifications that you require if you send me the format of your dictionary. Since you are new to the current dictionary module setup, I might also have a simpler solution for you ... A couple of days ago I checked a new module into Sandbox called ctakes-dictionary-lookup2 (how novel a name). It is a complete replacement of the current dictionary lookup module, but both can sit side-by-side in your local trunk sandbox or build. It has an example descriptor that tells it to read a bar-separated value file (BSV) as a dictionary, storing it (indexed) in memory for fast lookup. There is an example dictionary and xml descriptor for that dictionary. It accepts 2 or 3 column files in the format CUI|Text or CUI|TUI|Text. It automatically detects the number of columns, but they must be in that order. It also does not need the text fields to be tokenized, allowing it to accept Tumor, malignant as well as tumor , malignant as it will perform the tokenization upon reading the file. As the dictionary will be stored in-memory it should not be huge. If you do have a very large number of terms (50k) then I recommend an hsql db. The new module will take an hsql db with the fixed field names CUI, TUI, RINDEX, TCOUNT, TEXT, RWORD. I will explain what those mean in some documentation that I plan to check into sandbox later today, but I can help you build an hsql dictionary db ... Yesterday I checked into sandbox a project named dictionarytool. It is source-only, but I can give you a jar if you want one. Out-of-the-box it will build various dictionaries from a UMLS download. It can build BSV, Hsql (new format) and Hsql (current format) to be used by the new or current dictionary lookup modules. This devlist announcement is a little premature on my part. I will not get usage documentation into sandbox for a day or two, but I can send you copies as I go if you are in a hurry, or just give you xml snippets for the current module descriptors. If you send the format of your dictionary then that can be done quickly. I just wanted to let you know that there is another option wrt dictionary lookup. Sean -Original Message- From: Abhishek De [mailto:abhishek...@alumnux.com] Sent: Friday, February 28, 2014 6:58 AM To: dev@ctakes.apache.org Subject: How to add a new dictionary database to cTAKES Hi, How do I add a new database to the cTAKES pipeline to perform lookup from? How do I specify what columns to look up and how to annotate the text with the returned hits? I have gone through the DictionaryLookupAnnotatorDB.xml and LookupDesc_Db.xml files. However, I could not understand the meanings of the terms like lookupField, metaField, maxPermutationLevel and exclusionTags. If I add a new database, I need to configure this xml file properly. Please guide me regarding these problems. Thanks and Regards, Abhishek De
RE: getSeverity etc. for relation extractor
Hi James, It is starting to resemble a row of falling dominoes ... I ran with an incubator version of the location of extractor and it did seem to find multiple locations for a single d/d. Functionality may have changed since then. Thanks for all of your attention to this topic. Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Friday, March 21, 2014 4:34 PM To: 'dev@ctakes.apache.org' Subject: RE: getSeverity etc. for relation extractor Running from trunk, I don't get any relations for Rash on arm and leg :( If I change the text to pain in arm and leg I get one LocationOfTextRelation annotation with arg1=SignSymptomMention (pain) and arg2=AnatomicalSiteMention (arm) Does the relation extractor support creating a 2nd relation involving pain - the one between pain and leg (is this just an unfortunate choice of example) or does the relation extractor need enhancement before it would create mutiple location_of for a single SignSymptomMention or DiseaseDisorderMention BTW, I will have to debug the setting of bodyLocation in the code because even for pain in arm, when running from trunk, the LocationOfTextRelation annotation is being created, but the bodyLocation within the SignSymptomMention is not being set because the code in TemplateFillerAnnotator expects arg1 and arg2 to be swapped from what they currently are. I'll take a look at what it was in cTAKES 3.1 and find out if this is a bug in TemplateFillerAnnotator or something else. -- James -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Friday, March 21, 2014 12:30 PM To: dev@ctakes.apache.org Subject: RE: getSeverity etc. for relation extractor until we have a definite, well-defined need (from a user). Rash on arm and leg I don't follow what you mean by your item B) below [Rash].getLocationRelation() [Rash : Arm] [Rash].getLocation() [Arm] -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Friday, March 21, 2014 12:58 PM To: 'dev@ctakes.apache.org' Subject: RE: getSeverity etc. for relation extractor Yes, if there is more than one severity or location relation for a given identified annotation, currently the template filler does just take the last severity and or last location. I suggest not changing the type system to allow a list (FSArray), or at least holding off until we have a definite, well-defined need (from a user). I think instead, ideally, we would make the template filler smarter at picking which severity / which location when there is more than one for the given identified annotation. Therefore I'd rather not make it a list now, when in the long run I think it should be a single value. And in the meantime if someone has a need, they can look through the relations. Pei, I don't follow what you mean by your item B) below -- James -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Thursday, March 20, 2014 2:03 PM To: dev@ctakes.apache.org Subject: RE: getSeverity etc. for relation extractor Awesome! Thanks James... On Sean's point about many-to-one relationships. I think the current type system only supports 1 degree_of and severity_of for each IdentifiedAnnotation? Does the TemplateFiller component currently just take the last one in the list currently? Should we modify the type system to support this in the future- something like the below? A) Support many-to-one B) Separate out getting the relations and getting the actual identified annotations. One suggestion would be: IdentifiedAnnotation.getBodyLocations(): FSArrayIdentifiedAnnotation IdentifiedAnnotation.getBodyLocationRelations(): FSArrayLocationOfTextRelation IdentifiedAnnotation.getSeverity(): FSArrayModifier IdentifiedAnnotation.getSeverityRelations(): FSArrayDegreeOfTextRelation What do others think? --Pei -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Thursday, March 20, 2014 2:50 PM To: 'dev@ctakes.apache.org' Subject: RE: getSeverity etc. for relation extractor I saw the jira was assigned to me and had a few minutes so I implemented a fix and committed. It was more than just the one line. The name of the index in which the binary text relations has changed (now separate indexes instead of one for all binary text relations) so I had to change which index was searched. -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Thursday, March 20, 2014 9:28 AM To: dev@ctakes.apache.org Subject: RE: getSeverity etc. for relation extractor Thanks for confirm James. It seem like a bug... Chase, if you confirm if adding ddm.setSeverity(degreeOfTextRelation); works for you, I can commit the changes in trunk. Which also brings up some interesting points: 1) Should we populate IdentifiedAnnotation.severity() and bodylocationof() Directly
RE: getSeverity etc. for relation extractor
Hi James, I don't have an exact phrase to use. We used the location_of with a brain aneurysm project, but the corpus is elsewhere now. However, it would tag things such as [aneurysm] : [middle cerebral artery] and [aneurysm] : [cerebral artery] - which is different from arm/leg, but an example of 2 locations for one entity. From: Masanz, James J. [masanz.ja...@mayo.edu] Sent: Monday, March 24, 2014 11:05 AM To: 'dev@ctakes.apache.org' Subject: RE: getSeverity etc. for relation extractor I ran 3.1 against pain in arm and leg and I get just one location_of relation. And again no location_of relations for rash on arm and leg Sean, what was the exact phrase you used with the incubator version? (or was that a while ago and lost) -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Friday, March 21, 2014 3:59 PM To: dev@ctakes.apache.org Subject: RE: getSeverity etc. for relation extractor Hi James, It is starting to resemble a row of falling dominoes ... I ran with an incubator version of the location of extractor and it did seem to find multiple locations for a single d/d. Functionality may have changed since then. Thanks for all of your attention to this topic. Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Friday, March 21, 2014 4:34 PM To: 'dev@ctakes.apache.org' Subject: RE: getSeverity etc. for relation extractor Running from trunk, I don't get any relations for Rash on arm and leg :( If I change the text to pain in arm and leg I get one LocationOfTextRelation annotation with arg1=SignSymptomMention (pain) and arg2=AnatomicalSiteMention (arm) Does the relation extractor support creating a 2nd relation involving pain - the one between pain and leg (is this just an unfortunate choice of example) or does the relation extractor need enhancement before it would create mutiple location_of for a single SignSymptomMention or DiseaseDisorderMention BTW, I will have to debug the setting of bodyLocation in the code because even for pain in arm, when running from trunk, the LocationOfTextRelation annotation is being created, but the bodyLocation within the SignSymptomMention is not being set because the code in TemplateFillerAnnotator expects arg1 and arg2 to be swapped from what they currently are. I'll take a look at what it was in cTAKES 3.1 and find out if this is a bug in TemplateFillerAnnotator or something else. -- James -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Friday, March 21, 2014 12:30 PM To: dev@ctakes.apache.org Subject: RE: getSeverity etc. for relation extractor until we have a definite, well-defined need (from a user). Rash on arm and leg I don't follow what you mean by your item B) below [Rash].getLocationRelation() [Rash : Arm] [Rash].getLocation() [Arm] -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Friday, March 21, 2014 12:58 PM To: 'dev@ctakes.apache.org' Subject: RE: getSeverity etc. for relation extractor Yes, if there is more than one severity or location relation for a given identified annotation, currently the template filler does just take the last severity and or last location. I suggest not changing the type system to allow a list (FSArray), or at least holding off until we have a definite, well-defined need (from a user). I think instead, ideally, we would make the template filler smarter at picking which severity / which location when there is more than one for the given identified annotation. Therefore I'd rather not make it a list now, when in the long run I think it should be a single value. And in the meantime if someone has a need, they can look through the relations. Pei, I don't follow what you mean by your item B) below -- James -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Thursday, March 20, 2014 2:03 PM To: dev@ctakes.apache.org Subject: RE: getSeverity etc. for relation extractor Awesome! Thanks James... On Sean's point about many-to-one relationships. I think the current type system only supports 1 degree_of and severity_of for each IdentifiedAnnotation? Does the TemplateFiller component currently just take the last one in the list currently? Should we modify the type system to support this in the future- something like the below? A) Support many-to-one B) Separate out getting the relations and getting the actual identified annotations. One suggestion would be: IdentifiedAnnotation.getBodyLocations(): FSArrayIdentifiedAnnotation IdentifiedAnnotation.getBodyLocationRelations(): FSArrayLocationOfTextRelation IdentifiedAnnotation.getSeverity(): FSArrayModifier IdentifiedAnnotation.getSeverityRelations(): FSArrayDegreeOfTextRelation What do others think? --Pei -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu
RE: Temporal Information Extraction package has compile time error
Hi Manu, Speaking for the developers of that module, we are excited that you and others in the community are starting to show so much interest in temporal information extraction - enough to attempt builds and trial runs. The Temporal module is still in an academic experimental phase and there are some necessary models and custom third-party library extensions that are necessary to build but have not or cannot be checked into the cTakes repository. We hope to have Temporal ready for full build and use in the upcoming cTakes release, but until that time it will remain relatively unusable by the wider cTakes community. I apologize if its placement in trunk caused confusion. All of that having been written, if you have particular ideas on implementation, usage or anything else, please let us know. Sean -Original Message- From: Manu Sikka [mailto:manusi...@hotmail.com] Sent: Wednesday, March 26, 2014 11:15 PM To: dev@ctakes.apache.org Subject: Temporal Information Extraction package has compile time error Temporal Information Extraction package has compile time error Please look into it
RE: errors when run BagOfCUIsGenerator.java
Try to open https://uts-ws.nlm.nih.gov If that works then try https://uts-ws.nlm.nih.gov/restful/isValidctakes.umlsuser and see if you get a message like This XML file does not appear to have any style information associated with it. The document tree is shown below. If that works and you are comfortable with the code, try with umlsaddr : https://uts-ws.nlm.nih.gov/restful/isValidctakes.umlsuser vendor : NLM-6515182895 /** * @param umlsaddr - * @param vendor - * @param username - * @param password - * @return true if the server at umlsaddr approves of the vendor, user, password combination */ public static boolean isValidUMLSUser( final String umlsaddr, final String vendor, final String username, final String password ) { String data; try { data = URLEncoder.encode( licenseCode, UTF-8 ) + = + URLEncoder.encode( vendor, UTF-8 ); data += + URLEncoder.encode( user, UTF-8 ) + = + URLEncoder.encode( username, UTF-8 ); data += + URLEncoder.encode( password, UTF-8 ) + = + URLEncoder.encode( password, UTF-8 ); } catch ( UnsupportedEncodingException unseE ) { LOGGER.error( Could not encode URL for + username + with vendor license + vendor ); return false; } try { final URL url = new URL( umlsaddr ); final URLConnection connection = url.openConnection(); connection.setDoOutput( true ); final OutputStreamWriter writer = new OutputStreamWriter( connection.getOutputStream() ); writer.write( data ); writer.flush(); boolean result = false; final BufferedReader reader = new BufferedReader( new InputStreamReader( connection.getInputStream() ) ); String line; while ( (line = reader.readLine()) != null ) { final String trimline = line.trim(); if ( trimline.isEmpty() ) { break; } result = trimline.equalsIgnoreCase( Resulttrue/Result ); } writer.close(); reader.close(); return result; } catch ( IOException ioE ) { LOGGER.error( ioE.getMessage() ); return false; } } -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Wednesday, April 16, 2014 1:25 PM To: dev@ctakes.apache.org Subject: RE: errors when run BagOfCUIsGenerator.java Ying, Are you behind a proxy or firewall? If you're trying to use the umls resources, it attempts to make a call to their umls service to validate your credentials. --Pei -Original Message- From: Liu, Ying [mailto:l...@advisory.com] Sent: Wednesday, April 16, 2014 1:13 PM To: dev@ctakes.apache.org Subject: errors when run BagOfCUIsGenerator.java It failed when run BagOfCUIsGenerator.java. The followings are the error information. Thanks for your help. Ying Exception in thread main org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator failed. (Descriptor: file:/C:/Users/Ying/workspacectakes/ctakes/ctakes- dictionary- lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.init ialize AnalysisComponent(PrimitiveAnalysisEngine_impl.java:252) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.init ialize (PrimitiveAnalysisEngine_impl.java:156) at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(Analys i sEngineFactory_impl.java:94) at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(C ompositeResourceFactory_impl.java:62) at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java: 269) at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework .java:387) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java: 25 4) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init AS B(AggregateAnalysisEngine_impl.java:431) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init ializ eAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init ializ e(AggregateAnalysisEngine_impl.java:185) at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(Analys i sEngineFactory_impl.java:94) at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(C ompositeResourceFactory_impl.java:62) at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java: 269) at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework .java:354) at org.uimafit.factory.AnalysisEngineFactory.createAnalysisEngineFromPath (An alysisEngineFactory.java:147)
RE: lvg entries
Those variants are not used by the dictionary lookup. I did look at them to see if it was worthwhile for the new dictionary, but they are all over the place so I passed. From: Miller, Timothy [timothy.mil...@childrens.harvard.edu] Sent: Thursday, April 17, 2014 1:25 PM To: dev@ctakes.apache.org Subject: Re: lvg entries Pei and I had a similar discussion in person -- mapping from lexical variants to a stem might be useful. Pei also mentioned that one intended use might have been searching the dictionary with lexical variants, but I don't think that is done. Looking at the precision of the variants, I think its highly unlikely the speed tradeoff would be worth any improvements in recall. Finally, at least in eclipse doing a search on references to the method to retrieve the lemma entries turns up nothing. Tim On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote: I don’t know of any applications within cTAKES that make use of this… The reverse (mapping from these “variants” to the normal form) may be useful though. Dima On Apr 17, 2014, at 11:50, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: Sure, just as an example, I gave it a note with about 1000 words. It generates 11500 NonEmptyFSList elements (each is basically one lexical variant). For the word symptomatic, these are the first 10 of 20 lexical variants: Symptomaticer/JJ Symptomaticer/RB Symptomaticed/VB Symptomaticcing/VB Symptomatics/VB Symptomatics/NN Symptomaticked/VB Symptomatic/VB Symptomatic/JJ Symptomatic/RB Tim On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote: Tim, this is a very interesting observation. Could you please send a few examples of what LVG generates? Both sensical and non :) Dima On Apr 17, 2014, at 11:28, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: The LVG annotator creates an enormous number of lemmas for every WordToken in the CAS, and I'm wondering what the original purpose was? I think this is probably a minor bottleneck for speed but mostly a pretty big space hog (at least 50% of the space of xmi files in my tests). As of right now I'm not sure if any downstream components are using these lemmas, and on a manual inspection the precision seems to be pretty abysmal (meaning most of them are nonsensical as lexical variants), so as I said, just wondering if we can revisit why cTAKES generates so many and whether that component can be optimized. Thanks Tim
RE: lvg entries
+1 false -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Friday, April 18, 2014 2:54 PM To: dev@ctakes.apache.org Subject: Re: lvg entries Thanks for tracking that down Andy. I am making a pass at UimaFit-izing the configuration parameters for all the annotators in the default pipeline, before I create the static factory methods like we recently discussed. Should I go ahead and change this to make default behavior be false? Tim On 04/18/2014 12:47 AM, andy mcmurry wrote: There is a lot of config handling, maybe PostLemmas is being set to true or configInit() is not setting up the NLM wrapper incorrectly. ctakes-lvg *README* Note: as distributed, PostLemmas is set to false. This is done to reduce the size of the CAS. Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma annotations added to the CAS. *LvgAnnotator.xml * PostLemmas = True *LvgAnnotator.java* if (postLemmas) { lvgResource.getLvgLex() } On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. masanz.ja...@mayo.eduwrote: The normalizedForm field is filled in. It is used by dictionary lookup. So, for example, if the dictionary would contain lymph node but not lymph nodes, a document with text of lymph nodes would match the dictionary entry lymph node because node, being the normalized form of nodes, would be used when searching dictionary entries (in addition to searching dictionary entries for nodes) -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Thursday, April 17, 2014 4:33 PM To: dev@ctakes.apache.org Subject: Re: lvg entries Quick follow-up since I was interested. The current dependency parser does have the option to use ctakes lemmas or do its own lemmatizing, but that doesn't use the lemma field, it uses the normalizedForm field. I'm not sure if that field is actually ever filled in -- on my example data it is always null. Tim On 04/17/2014 01:57 PM, Masanz, James J. wrote: Offhand I recall at least one of the dependency parsers used the Lemma annotations at one point. Not sure if still does. There is an option for turning off the posting of the lemmas to the cas. Hope that helps -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Thursday, April 17, 2014 11:27 AM To: dev@ctakes.apache.org Subject: lvg entries The LVG annotator creates an enormous number of lemmas for every WordToken in the CAS, and I'm wondering what the original purpose was? I think this is probably a minor bottleneck for speed but mostly a pretty big space hog (at least 50% of the space of xmi files in my tests). As of right now I'm not sure if any downstream components are using these lemmas, and on a manual inspection the precision seems to be pretty abysmal (meaning most of them are nonsensical as lexical variants), so as I said, just wondering if we can revisit why cTAKES generates so many and whether that component can be optimized. Thanks Tim
RE: ytex merged into trunk
Hi Vijay, I did a checkout this morning and I'm getting compile errors from Maven. If I just run mvn compile then I get an error while building ytex claiming that the package has not been created. Is there a reversed dependency? If I run mvn compile package then ytex seems to run through, but there is an error in the test of ytex-uima (see below). Any ideas? Thanks, Sean Running org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest ... 2014-04-28 10:50:43,074 INFO org.hibernate.dialect.Dialect - HHH000400: Using dialect: org.hibernate.dialect.HSQLDialect 2014-04-28 10:50:43,112 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: -22, SQLState: S0002 2014-04-28 10:50:43,112 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - Table not found in statement [select uimatype0_.ui ma_type_id as uima_typ1_21_, uimatype0_.uima_type_name as uima_typ2_21_, uimatype0_.table_name as table_na3_21_ from PUBLIC.ref_uima _type uimatype0_] ... Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.277 sec FAILURE! Results : Tests in error: test(org.apache.ctakes.ytex.uima.annotators.DBCollectionReaderTest): Unable to initialize group definition. Group resource name [c lasspath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory key [ytexApplicationContext]; nested exception is org.springframe work.beans.factory.BeanCreationException: Error creating bean with name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev /ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [org.springframework.con text.support.ClassPathXmlApplicationContext]: Constructor threw exception; nested exception is org.springframework.beans.factory.Bea nCreationException: Error creating bean with name 'documentMapperService' defined in class path resource [org/apache/ctakes/ytex/uim a/beans-uima-mapper.xml]: Invocation of init method failed; nested exception is org.hibernate.exception.SQLGrammarException: could n ot prepare statement org.apache.ctakes.ytex.uima.annotators.DBConsumerTest: Unable to initialize group definition. Group resource name [classpath*:org/ apache/ctakes/ytex/uima/beanRefContext.xml], factory key [ytexApplicationContext]; nested exception is org.springframework.beans.fac tory.BeanCreationException: Error creating bean with name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev/ApacheCtakesT runk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]: Instantiation of bean failed; nested except ion is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [org.springframework.context.support.C lassPathXmlApplicationContext]: Constructor threw exception; nested exception is org.springframework.beans.factory.BeanCreationExcep tion: Error creating bean with name 'documentMapperService' defined in class path resource [org/apache/ctakes/ytex/uima/beans-uima-m apper.xml]: Invocation of init method failed; nested exception is org.hibernate.exception.SQLGrammarException: could not prepare sta tement org.apache.ctakes.ytex.uima.annotators.DBConsumerTest testDictionaryLookupIntegrated(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest): Initialization of annotator class org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator failed. (Descriptor: file:/C:/Spiffy/Dev/ApacheCtakesTrunk/cta kes-ytex-uima/desc/analysis_engine/SegmentRegexAnnotator.xml) testDictionaryLookupSimple(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest) testDisambiguate(org.apache.ctakes.ytex.uima.annotators.SenseDisambiguatorAnnotatorTest): Unable to initialize group definition. G roup resource name [classpath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory key [ytexApplicationContext]; nested excepti on is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev/ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]: Instanti ation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [or g.springframework.context.support.ClassPathXmlApplicationContext]: Constructor threw exception; nested exception is org.springframew ork.beans.factory.BeanCreationException: Error creating bean with name 'documentMapperService' defined in class path resource [org/a pache/ctakes/ytex/uima/beans-uima-mapper.xml]: Invocation of init method failed; nested exception is org.hibernate.exception.SQLGram marException: could not prepare statement org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest: Unable to initialize group definition. Group resource name
RE: ytex merged into trunk
Completely new error. I have taken this offline until we figure out what is going on. -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Monday, April 28, 2014 1:47 PM To: dev@ctakes.apache.org Subject: Re: ytex merged into trunk Hello All, I can't reproduce this build error. It appears that maven does not want to run copy-dependencies in the compile phase. However, I have tried building this with maven 3.2.1 and maven 3.1.0 and it works fine for both. @Sean - can you send me the output of mvn -x clean install -pl ctakes-ytex (executed from ctakes root dir) This is the plugin that maven is complaining about: plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-dependency-plugin/artifactId executions execution idcopy-dependencies/id phasecompile/phase goals goalcopy-dependencies/goal /goals configuration outputDirectory${basedir}/target/lib/outputDirectory overWriteReleasesfalse/overWriteReleases overWriteSnapshotsfalse/overWriteSnapshots overWriteIfNewertrue/overWriteIfNewer /configuration /execution /executions /plugin On Mon, Apr 28, 2014 at 1:26 PM, vijay garla vnga...@gmail.com wrote: sorry about that. I will investigate. -vj On Mon, Apr 28, 2014 at 11:00 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Vijay, I did a checkout this morning and I'm getting compile errors from Maven. If I just run mvn compile then I get an error while building ytex claiming that the package has not been created. Is there a reversed dependency? If I run mvn compile package then ytex seems to run through, but there is an error in the test of ytex-uima (see below). Any ideas? Thanks, Sean Running org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest ... 2014-04-28 10:50:43,074 INFO org.hibernate.dialect.Dialect - HHH000400: Using dialect: org.hibernate.dialect.HSQLDialect 2014-04-28 10:50:43,112 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: -22, SQLState: S0002 2014-04-28 10:50:43,112 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - Table not found in statement [select uimatype0_.ui ma_type_id as uima_typ1_21_, uimatype0_.uima_type_name as uima_typ2_21_, uimatype0_.table_name as table_na3_21_ from PUBLIC.ref_uima _type uimatype0_] ... Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.277 sec FAILURE! Results : Tests in error: test(org.apache.ctakes.ytex.uima.annotators.DBCollectionReaderTest): Unable to initialize group definition. Group resource name [c lasspath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory key [ytexApplicationContext]; nested exception is org.springframe work.beans.factory.BeanCreationException: Error creating bean with name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev /ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [org.springframework.con text.support.ClassPathXmlApplicationContext]: Constructor threw exception; nested exception is org.springframework.beans.factory.Bea nCreationException: Error creating bean with name 'documentMapperService' defined in class path resource [org/apache/ctakes/ytex/uim a/beans-uima-mapper.xml]: Invocation of init method failed; nested exception is org.hibernate.exception.SQLGrammarException: could n ot prepare statement org.apache.ctakes.ytex.uima.annotators.DBConsumerTest: Unable to initialize group definition. Group resource name [classpath*:org/ apache/ctakes/ytex/uima/beanRefContext.xml], factory key [ytexApplicationContext]; nested exception is org.springframework.beans.fac tory.BeanCreationException: Error creating bean with name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev/ApacheCtakesT runk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]: Instantiation of bean failed; nested except ion is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [org.springframework.context.support.C lassPathXmlApplicationContext]: Constructor threw exception; nested exception is org.springframework.beans.factory.BeanCreationExcep tion: Error creating bean with name 'documentMapperService' defined in class path resource [org/apache/ctakes/ytex/uima/beans-uima-m apper.xml]: Invocation of init method failed; nested exception is org.hibernate.exception.SQLGrammarException: could not prepare sta tement org.apache.ctakes.ytex.uima.annotators.DBConsumerTest testDictionaryLookupIntegrated(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest): Initialization of annotator class org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator failed. (Descriptor: file:/C:/Spiffy/Dev/ApacheCtakesTrunk/cta kes-ytex-uima/desc/analysis_engine
RE: Preparing for an Apache cTAKES 3.2 Release?
it would be incredibly helpful to have thorough documentation I agree. There is some documentation in the module's doc/ directory, but it is very brief. There are also some example descriptors in the example/ directory. The -resource also has some example xmls and dictionaries. It isn't much, but I have a small plate heaped with large portions of many courses and very little time to document. If there are questions please write me and I'll update the documentation as necessary. Anybody else that feels inclined can also add to the docs. Eventually the documentation should be moved to reside with the rest of the cTakes docs. Sean -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Wednesday, June 11, 2014 9:33 AM To: dev@ctakes.apache.org Subject: Re: Preparing for an Apache cTAKES 3.2 Release? regardless of the name, I think it would be incredibly helpful to have thorough documentation on the dictionary lookup, how to configure it, and how to create new dictionaries. I would venture to say that this is the most important component in cTAKES, and probably the one that has generated the most questions on the newsgroup. On Wed, Jun 11, 2014 at 9:21 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: . The newer NER should have in its name the Behavior... I agree, but the *2 module is a complete replacement for the current lookup. It does not (really) have any different behavior, just a different implementation and performance. We plan to swap out the old with the new in the next release and get rid of the *2 suffix. So, any name provided now is just temporary - unless people don't like the name dictionary-lookup at all. In my original sandbox it was named RareWordLookup, a nod to its implementation. However, this doesn't help any users. Sean -Original Message- From: andy mcmurry [mailto:mcmurry.a...@gmail.com] Sent: Wednesday, June 11, 2014 3:09 AM To: dev@ctakes.apache.org Subject: Re: Preparing for an Apache cTAKES 3.2 Release? 2 doesn't mean much. The newer NER should have in its name the Behavior... Perhaps something like MetaMap Usage http://metamap.nlm.nih.gov/Docs/MM09_Usage.shtml --allow_overmatches or --allow_concept_gaps or .other? Since yTex already provides a pluggable *DictionaryLookup, *that seems like the best place to define the differing Behavior / Usage. https://cwiki.apache.org/confluence/display/CTAKES/User's+Guide https://code.google.com/p/ytex/wiki/DictionaryLookup_V05 AndyMC On Tue, Jun 10, 2014 at 9:55 AM, britt fitch britt.fi...@gmail.com wrote: I don’t have an issue with the *-2 name. I also don’t have any objections to renaming it. It might be nice to keep the old dictionary code around for a release-worth of time but after that I would vote purging it. If someone needs it after that it’ll be accessible in the archived releases. On Jun 10, 2014, at 12:48 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: I think James has a fair point here. It may be worthwhile biting the bullet here and push forward. Since this essentially will be a full replacement of the ctakes-dictionary-lookup module, a good option maybe to just replace the entire module now and rename the existing module to * _deprecated. How do folks feel about that? In a nutshell, ctakes-dictionary-lookup-2 is a faster algorithm with a simpler code base- and comparable results (Sean has a full comparison in the documentation for those who are curious). --Pei -Original Message- From: britt fitch [mailto:britt.fi...@gmail.com] Sent: Monday, June 09, 2014 5:42 PM To: dev@ctakes.apache.org Subject: Re: Preparing for an Apache cTAKES 3.2 Release? There is some documentation in the dictionary2 module under /doc/DictionaryLookupHelp.{txt | docx} that gives some some details of the different lookup implementation options within that module that I found helpful. On Jun 9, 2014, at 5:17 PM, Masanz, James J. masanz.ja...@mayo.edu wrote: Will ctakes-dictionary-lookup2 remain the name for the new dictionary lookup or will it have a name that reflects the algorithm? Is there a description of it that will help users to decide when to use one dictionary lookup component vs. the other. -- James -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Friday, June 06, 2014 12:34 PM To: dev@ctakes.apache.org Subject: Preparing for an Apache cTAKES 3.2 Release? Hi, The 3.2 release was slated to be release end of this month (Jun 21). Since I volunteered to be the RM for this release, just like the past releases, I was planning to create a branch/tag next week from trunk and dev can continue. Feel free to take a look at any outstanding Jira issues [1] that you may want
RE: Preparing for an Apache cTAKES 3.2 Release?
I guess that I've got one question at this point: Is the name being given to the -new- dictionary lookup module temporary or permanent? I was under the assumption that it was temporary and that with the switch to it being default (and eventually only) the module would simply be named dictionary-lookup. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Monday, June 16, 2014 11:24 AM To: 'dev@ctakes.apache.org' Subject: RE: Preparing for an Apache cTAKES 3.2 Release? I'd rather something else than dictionary-lookup-fast. If we come up with something even faster than this one, having an older one called fast could be confusing. -Original Message- From: Dligach, Dmitriy [mailto:dmitriy.dlig...@childrens.harvard.edu] Sent: Monday, June 16, 2014 9:55 AM To: cTAKES Developer list Subject: Re: Preparing for an Apache cTAKES 3.2 Release? +1 Dima On Jun 16, 2014, at 9:42, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: Sorry to weigh in so late on this -- just returned from vacation. If we want to have a one release delay before making dictionary2 default for testing/documentation/configuration purposes, and there isn't an obvious function-related name, and the main difference is speed, maybe we could call it dictionary-lookup-fast? Besides being accurate and more descriptive than 2, it might lure people into trying it and give us some feedback. Tim On 06/16/2014 10:34 AM, Chen, Pei wrote: I'm making some significant updates to trunk that may cause some instability for this release. It should be mostly transparent, but let me know if you encounter any issues with trunk. Also, regarding the dictionary-lookup2. If there are no strong objections, we can leave default to as-is (old behavior). Folks who wish to give the new one a try are welcome to do so and we can change the default behavior in a future release. [ducks for cover now] --Pei -Original Message- From: ksa...@gmail.com [mailto:ksa...@gmail.com] On Behalf Of Karthik Sarma Sent: Wednesday, June 11, 2014 9:58 AM To: dev@ctakes.apache.org Subject: Re: Preparing for an Apache cTAKES 3.2 Release? Agreed On Wednesday, June 11, 2014, vijay garla vnga...@gmail.com wrote: regardless of the name, I think it would be incredibly helpful to have thorough documentation on the dictionary lookup, how to configure it, and how to create new dictionaries. I would venture to say that this is the most important component in cTAKES, and probably the one that has generated the most questions on the newsgroup. On Wed, Jun 11, 2014 at 9:21 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: . The newer NER should have in its name the Behavior... I agree, but the *2 module is a complete replacement for the current lookup. It does not (really) have any different behavior, just a different implementation and performance. We plan to swap out the old with the new in the next release and get rid of the *2 suffix. So, any name provided now is just temporary - unless people don't like the name dictionary-lookup at all. In my original sandbox it was named RareWordLookup, a nod to its implementation. However, this doesn't help any users. Sean -Original Message- From: andy mcmurry [mailto:mcmurry.a...@gmail.com] Sent: Wednesday, June 11, 2014 3:09 AM To: dev@ctakes.apache.org Subject: Re: Preparing for an Apache cTAKES 3.2 Release? 2 doesn't mean much. The newer NER should have in its name the Behavior... Perhaps something like MetaMap Usage http://metamap.nlm.nih.gov/Docs/MM09_Usage.shtml -- allow_overmatches or --allow_concept_gaps or .other? Since yTex already provides a pluggable *DictionaryLookup, *that seems like the best place to define the differing Behavior / Usage. https://cwiki.apache.org/confluence/display/CTAKES/User's+Guide https://code.google.com/p/ytex/wiki/DictionaryLookup_V05 AndyMC On Tue, Jun 10, 2014 at 9:55 AM, britt fitch britt.fi...@gmail.com wrote: I don't have an issue with the *-2 name. I also don't have any objections to renaming it. It might be nice to keep the old dictionary code around for a release-worth of time but after that I would vote purging it. If someone needs it after that it'll be accessible in the archived releases. On Jun 10, 2014, at 12:48 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: I think James has a fair point here. It may be worthwhile biting the bullet here and push forward. Since this essentially will be a full replacement of the ctakes-dictionary-lookup module, a good option maybe to just replace the entire module now and rename the existing module to * _deprecated. How do folks feel about that? In a nutshell, ctakes-dictionary-lookup-2 is a faster algorithm with a simpler code base- and comparable results (Sean has a full comparison
RE: DeepPheno: guidance on CTakes
Hi Pei, Nice examples. The pipeline builder could be simpler (divvied), but they shouldn't leave anybody confused. +1 for the uimafit annotations! -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Friday, June 27, 2014 11:11 AM To: Hochheiser, Harry Stewart; dev@ctakes.apache.org Subject: RE: DeepPheno: guidance on CTakes +dev Harry, I've just checked in some two example java classes [1] that should make life a lot easier for developers to create and add new cTAKES Annotators. It will shield users initially from all of the complexities of UIMA, XML Descriptors, cTAKES, etc. Just check out the latest: svn co http://svn.apache.org/repos/asf/ctakes/trunk mvn clean compile --Pei [1] http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-examples/src/main/java/org/apache/ctakes/examples/ -Original Message- From: Hochheiser, Harry Stewart [mailto:har...@pitt.edu] Sent: Thursday, June 26, 2014 5:31 PM To: Chen, Pei Subject: DeepPheno: guidance on CTakes Pei: As I'm now digging into cTAKES as part of our DeepPheno project (and some other related efforts), I'm hoping you can help with a quick question. Is there any guide/documentation on the process for adding new annotators to cTAKES? I've dug into the apache site and mailing list archives, but haven't had much luck. Thanks! -harry Harry Hochheiser University of Pittsburgh Department of Biomedical Informatics har...@pitt.edu 412 648 9300
RE: Bacterium Dictionary
Hi Nick, There are ~26,000 T007 Bacterium (falls under Living Being) entries in UMLS 2013aa. They aren't in the cTakes dictionary, but you can build a separate bacteria dictionary using the dictionary creator tool in cTakes sandbox. It can create dictionaries formatted for use with both available cTakes-dictionary-lookup modules. I have a full living beings dictionary, if you want to somehow confirm your umls license then I could pull out the bacteria for you. Sean -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Monday, June 30, 2014 12:50 PM To: dev@ctakes.apache.org Subject: Re: Bacterium Dictionary Nick, I am not sure how complete it is, but I believe the UMLS has the semantic type of Bacterium https://uts.nlm.nih.gov//semanticnetwork.html#Bacterium;0;0;2014AA [T007] It's most likely not included in the default cTAKES dictionaries though... Thanks, Pei On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish snika...@emerginghealthit.com wrote: Hi there, I was wondering if Ctakes has any Bacterium Dictionary? I need to extract information for bacteria like “Enterococcus Faecium”, “Pseudomonas Aeruginosa “ , etc and I was wondering if I can do it by using Ctakes annotators? Thanks, *Nick Nikandish* *Product Development Software Engineer* Clinical Research Informatics *Emerging Health* *Montefiore Information Technology* 6 Executive Blvd. Suite 290, Yonkers, NY 10701 914-457-6792 Office snika...@montefiore.org www.emerginghealthit.com www.montefiore.org [image: logo-montefiore-it]
RE: [VOTE] Release Apache cTAKES 3.2.0
+1 Pulled fresh candidate, built, and ran Clinical using CPE without problem. Other than that, no testing. SVN gave me a problem initially (checked out as anonymous) asking for a password then flunking the checkout, but an update completed it. I blame the heat. From: Masanz, James J. [masanz.ja...@mayo.edu] Sent: Monday, June 30, 2014 10:24 PM To: dev@ctakes.apache.org Subject: RE: [VOTE] Release Apache cTAKES 3.2.0 This is pretty obvious, but since this is a record of what was voted upon, note that some of the URLs contain an extra ctakes-3.2.0/ For example http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz should be just http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz -- James From: Pei Chen [chen...@apache.org] Sent: Friday, June 27, 2014 5:15 PM To: dev@ctakes.apache.org Subject: [VOTE] Release Apache cTAKES 3.2.0 Hi all, This is a call for a vote on releasing the following candidate (rc1) as Apache cTAKES 3.2.0. The major changes include: - New optional YTEX component(s) (Yale Extensions to cTAKES) - New optional improved/faster dictionary lookup (dictionary-lookup-fast) - New optional Temporal component (Time + Event extraction. Relations will be including in a future release.) - Other bug fixes/enhancements from Jira [TODO: Online documentation still needs to be updated on wiki for the abo] For more detailed information on the changes/release notes, please visit: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621version=12324066 The release was made using the cTAKES release process documented here: http://ctakes.apache.org/ctakes-release-guide.html The candidate is available at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz /.zip The tag to be voted on: http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/ The MD5 checksum of the tarball can be found at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.md5 /.zip.md5 The signature of the tarball can be found at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.asc /.zip.asc Apache cTAKES' KEYS file, containing the PGP keys used to sign the release: https://dist.apache.org/repos/dist/release/ctakes/KEYS Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is open for at least the next 72 hours. Only votes from the cTAKES PMC are binding, but folks are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache cTAKES 3.2.0 [ ] -1 Do not release the packages because... Also, the convenience binary can be found at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-bin.tar.gz /.zip Note: It's tempoarily on people.a.o because the artifacts were too large for https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on increasing the limit). Thanks!
RE: [VOTE] Release Apache cTAKES 3.2.0 (rc2)
+1 for the ytex method of handling a umls login before download of the umls resources. While this also doesn't truly prevent people from sharing files (data) without a umls account, it is a little bit of a nicer mechanism. Aside ... Does anybody out there have experience with izpack? (izpack.org) Creation of an InstallAnywhere style module is under consideration ... -Original Message- From: vijay garla [mailto:vnga...@gmail.com] Sent: Wednesday, July 09, 2014 10:30 AM To: dev@ctakes.apache.org Subject: Re: [VOTE] Release Apache cTAKES 3.2.0 (rc2) ctakes-ytex-lib-3.1.2-SNAPSHOT.zip https://ytex.googlecode.com/files/ctakes-ytex-lib-3.1.2-SNAPSHOT.zip - this contains non-asf compliant ytex libs. I would like to add it to the sourceforge site / or add it to the ctakes resources directly (that way users simply have to unzip a single zip file) ctakes-ytex-resources-3.1.2-SNAPSHOT.zip http://www.ytex-nlp.org/umls.download/secure/3.1/ctakes-ytex-resources- 3.1.2-SNAPSHOT.zip - this contains data derived from the UMLS - concept graphs and dictionary lookup tables. downloading this requires a UTS login. It is conceptually no different from the ctakes resources, so I believe it would be OK to add it to that zip file, but I'm not a lawyer. On another note: I think forcing users to specify the UTS username/password and contacting NIH every time you run cTAKES is problematic, and doesn't prevent users who don't have a valid UTS login from viewing the data contained in the lucene index dictionary. I personally believe requiring a UTS login to download would be the best way to make resources derived from the UMLS available to users (this is what I'm doing for ytex-resources). to summarize: for now, I would like to add the ytex libs to the ctakes resources zip. -vj On Wed, Jul 9, 2014 at 4:04 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: The maven artifacts are also available in the staging area: https://repository.apache.org/content/repositories/orgapachectakes-100 1 VJ: Just curious- how did you envision ytex users downloading the jars/war? From the distro bin.zip or from maven central? --Pei -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Tuesday, July 08, 2014 6:11 PM To: dev@ctakes.apache.org Subject: [VOTE] Release Apache cTAKES 3.2.0 (rc2) Hi all, The main difference between rc1 and rc2 is that we removed the lvg-res and assertion-res.jar from the distro. They still need to be unpacked. This is a call for a vote on releasing the following candidate (rc2) as Apache cTAKES 3.2.0. The major changes include: - New optional YTEX component(s) (Yale Extensions to cTAKES) - New optional improved/faster dictionary lookup (dictionary-lookup-fast) - New optional Temporal component (Time + Event extraction. Relations will be including in a future release.) - Other bug fixes/enhancements from Jira [TODO: Online documentation still needs to be updated on wiki] For more detailed information on the changes/release notes, please visit: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313 621 version=12324066 The release was made using the cTAKES release process documented here: http://ctakes.apache.org/ctakes-release-guide.html The candidate is available at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes - 3.2.0-src.tar.gz /.zip The tag to be voted on: http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc2 The MD5 checksum of the tarball can be found at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes - 3.2.0-src.tar.gz.md5 /.zip.md5 The signature of the tarball can be found at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes - 3.2.0-src.tar.gz.asc /.zip.asc Apache cTAKES' KEYS file, containing the PGP keys used to sign the release: https://dist.apache.org/repos/dist/release/ctakes/KEYS Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is open for at least the next 72 hours. Only votes from the cTAKES PMC are binding, but folks are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache cTAKES 3.2.0 [ ] -1 Do not release the packages because... Also, the convenience binary can be found at: http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes - 3.2.0-bin.tar.gz /.zip Note: It's temporarily on people.a.o because the artifacts were too large for https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on increasing the limit). Thanks!
RE: Lucene for UMLS2014
Hi Harpreet, If you are willing to use cTakes 3.2, try the dictionary-lookup-fast module as a replacement of the default dictionary-lookup. That module has a new dictionary resource (hsql, not lucene) and slightly different methods for lookup and matching. In time trials it has been faster than the default module (hence the name). Accuracy depends upon the parameter settings, but in the tests performed so far the results are comparable or better. The new dictionary is much leaner than the current default dictionary, small enough to port from the hsql cached version to a hsql in-memory version. Using the in-memory version makes dictionary lookup practically instantaneous (hundredths of a second). Limited documentation is available in the module's doc/ directory. I will be on vacation for a week, but please don't hesitate to write if you have any questions. Sean From: Harpreet Khanduja [hsk5...@rit.edu] Sent: Thursday, July 17, 2014 5:07 PM To: dev@ctakes.apache.org Subject: Lucene for UMLS2014 Hello, I would be grateful if someone could help. I created a lucene index for umls2014 but only for snomed vocabulary. I did this because I thought this would reduce the dictionary look up time. But it still almost the same. Is there any other way to improve the dictionary look up time? Thank you, Harpreet
RE: question about sentence segmentation
Hi Tim, It would be preferable to me to put sentence breaks in between the sections, so the first two sentences would be: 1) PE: Lymphonodes... 2) Lungs: normal... The punctuation is (always) after the logical break, being Term: for a Term:Definition list. I think that the first three sentences should be 1) PE: 2) Lymphnodes: neck and ... 3) CV: regular and ... Where the first line is an overarching Term: sentence (tree root), because each Term:Definition line that follows is within the physical exam. Just an fyi. Does that make sense? Haven't had my coffee ... Sean -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Saturday, August 02, 2014 7:44 AM To: dev@ctakes.apache.org Subject: RE: question about sentence segmentation I'm annotating some oncology notes from SHARP right now, and they are basically a nightmare for our current sentence segmentation model. Mainly because they eschew explicit markers between sentences. I thought I'd ping the list with some interesting examples just in case it stimulates ideas. But it seems to me that at some point we'll have to augment the opennlp module (preferable) or roll our own to handle cases like these. In this example a bunch of background is on one line with no punctuation between logical breaks: PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 normal, no murmur, click, rub or gal*, chest is clear without rales or wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative findings right/left breast with mild swelling, warmth, mild erythema, slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender. It would be preferable to me to put sentence breaks in between the sections, so the first two sentences would be: 1) PE: Lymphonodes... 2) Lungs: normal... but without any candidate characters to split the sentence I don't think it is possible. Another example that breaks our model in a different way (truncated): 1. Baseline labwork including tumor markers 2. Start DD AC on Friday 8/1 with RN chemo teach 3. S U parent study Our model will break on the period after the number, so we'd probably get: 1. Baseline labwork including tumor markers 2. Start DD 3. S U parent study So the number is going in exactly the wrong place. Here it would be preferable to get: 1. Baseline labwork... 2. Start DD... 3. S U parent study Anyways, just something to think about! The problem is much more complex in clinical data than in edited text, but I'm sure we all knew that already :) Tim From: Miller, Timothy [timothy.mil...@childrens.harvard.edu] Sent: Monday, July 28, 2014 2:38 PM To: dev@ctakes.apache.org Subject: Re: question about sentence segmentation Yes, you're right about that Britt. I've been doing some annotations side by side with a treebank viewer and think I have a pretty good handle on the actual rules. Basically, if a header or list identifier is followed by a period or a newline it is considered a sentence break and otherwise it is part of the sentence. e.g. 1. 20 mg flomax is two sentences, while: 1 - 20 mg flomax is one sentence. For headings: Allergies: Pt is allergic to aspirin. is one sentence, while: Allergies: Pt is allergic to aspirin. is two sentences. I'm planning to follow these guidelines. Tim On 07/28/2014 01:53 PM, britt fitch wrote: Thanks for the document, Tim. It seems to not be explicit about how to handle sentences occurring in lists. Are you still considering having the list number as outside of the sentence? Thanks Britt On Jul 25, 2014, at 7:09 AM, Miller, Timothy timothy.mil...@childrens.harvard.edumailto:timothy.mil...@childrens.harv ard.edu wrote: Checking with Guergana and other colleagues here the advice is to have the sentence segmenter follow the treebank guidelines for sentence segmentation: http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf They are a bit light on detail but fortunately we have some treebanked data so I will use that for the training data and hopefully that will illuminate the tricky cases. Tim From: Masanz, James J. [masanz.ja...@mayo.edumailto:masanz.ja...@mayo.edu] Sent: Tuesday, July 15, 2014 4:39 PM To: 'dev@ctakes.apache.orgmailto:dev@ctakes.apache.org' Subject: RE: question about sentence segmentation Sorry, I don't know if there was a reason. If you haven't checked with Guergana, you might want to ask her if she had a reason or if it was just the way it had been since that corpus was created. -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Tuesday, July 15, 2014 3:34 PM To:
RE: code value for vocabulary in dic-lookup-fast
Hi Harpreet, I don't know if this has yet been answered (I'm still finding vacation-time emails), but the Snomed-ct, Rx-norm, etc. codes were removed from the -fast dictionary for speed. Basically, any single UMLS Cui can have multiple different snomed-ct codes (for instance), and adding extra rows per-code leads to a lot of waste. A post- Cui assignment step could be performed to assign non-unique snomed-ct codes (for instance) to discovered unique Cuis. I am actually (slowly) conceptualizing an annotator that does just that - mapping Cuis to other source codes. It would be an optional annotator, lean and fast. No promise on a date for startup code in sandbox. Sean -Original Message- From: Harpreet Khanduja [mailto:hsk5...@rit.edu] Sent: Friday, July 25, 2014 2:33 PM To: dev@ctakes.apache.org Subject: code value for vocabulary in dic-lookup-fast Hello, I am using ctakes-dictionary-lookup-fast to annotation purposes. But, there is no value for code attribute like it was there when I used ctakes-dictionary-lookup. Is there any way I can find out the code attribute value using ctakes-dictionary- lookup-fast? Thank you so much for the help, Harpreet
RE: v_snomed_fword_lookup view
Hi Clayton, I don't know how the ytex dictionary lookup works, so I'm afraid that I can't help you with an answer. Maybe Vijay is the best person to do this. If you aren't tied to ytex you could try the new cTakes dictionary-lookup-fast. I tested Patient came in with a malar rash and it found malar and malar rash. Vijay, At some point the lookup-fast module will be the default for the cTakes clinical pipeline. In order to synchronize the ytex lookup with cTakes, would you like to eventually work together on reusing the same code for ytex? I have no idea what ytex does, but I know the ins and outs of the cdl-fast module. Sean -Original Message- From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of Clayton Turner Sent: Friday, August 08, 2014 2:08 PM To: dev@ctakes.apache.org Subject: v_snomed_fword_lookup view Hi Everyone: I have a question about how the v_snomed_fword_lookup view works when running the CPE. So my understanding of the view is that it is a view comprised of the ytex.umls_aui_fword table, the umls.mrconso table and bits/pieces from other umls tables. I feel like this is not completely correct or my idea of how the join to create the view works is off. For example, let's say I want the CPE to find malar (e.g. malar rash) as a concept in the annotations. It never happens after running my CPE descriptor and I cannot find it in my v_snomed_fword_lookup view. select count(*) from umls_aui_fword where fword='malar'; yields 34 results select count(*) from umls.mrconso where str='malar'; yields 3 results. So clearly these two tables know what the cui and context(s) are for malar . Yet, whenever I run a gold standard set of notes through the CPE, malar is constantly flagged as just a word token and the concept is never grabbed. This is recurrent for lots of other concepts, as well, I just wanted to use an example to illustrate my issue. Some troubleshooting I already went through: 1) Reinstalled ytex and umls database objects 2) Reinstalled a second time after redownloading umls through metamorphosys, ensuring that snomed vocabularies were included (also checked file sizes and noticed a big difference so I know those vocabularies ARE included Anyone got any ideas as to what the issue could be? Thank you, Clayton Turner
RE: v_snomed_fword_lookup view
Thanks Harpreet, That is definitely necessary to build! Those lines should already be in the pom, but commented out. I think that some version/branching issues may have arisen at some point wrt this module ... If somebody beats me to it then cheers, otherwise I will try to check out tonight and get all the bits in place. Sean -Original Message- From: Harpreet Khanduja [mailto:hsk5...@rit.edu] Sent: Monday, August 11, 2014 1:12 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view Hello Clayton, I do not know about ytex, but I did switch from dictionary-lookup to dictionary- lookup-fast. I update my ctakes-dictionary-lookup-fast project using maven. I think I used Team- Update and switched to the latest revision available and then I downloaded new 3.2 resources from the for umls. and then I added these resources to my ctakes-dictionary-lookup-fast resources folder and also the classpath in ctakes- clinical-pipeline. Then I changed the pom.xml file which belongs to the whole ctakes project and added dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-res/artifactId version${ctakes.version}/version /dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-fast/artifactId version${ctakes.version}/version /dependency these two dependencies to the file. After this, I also added the dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-fast/artifactId /dependency to the pom.xml of ctakes-clinical-pipeline. And then add the resources folder in ctakes-clinical-pipeline using build path configuration under add class option. After this it should work. Regards, Harpreet On Mon, Aug 11, 2014 at 12:44 PM, Clayton Turner caturn...@g.cofc.edu wrote: I still get the same error with the ctakes3.2 branch. Any suggestions? On Mon, Aug 11, 2014 at 12:06 PM, Clayton Turner caturn...@g.cofc.edu wrote: I'm going to do a clean install through the repo rather than the binaries and see if that fixes my issue because I think I just read a past post saying the lookup2 folders exist there. On Mon, Aug 11, 2014 at 11:52 AM, Clayton Turner caturn...@g.cofc.edu wrote: When navigating to ctakes-dictionary-lookup-fast\desc\analysis_engine there are 2 files, assumedly analysis engines. SnomedLookupAnnotator.xml and SnomedOvLookupAnnotator.xml If I pick either, I put in my UMLS information but receive an error when trying to run the CPE: Initialization of CAS Processor with name SnomedOvLookupAnnotator failed. CausedBy: org.apache.uima.resource.ResourceConfigurationException: Initialization of CAS processor with name SnomedOvLookupAnnotator failed. CausedBy: org.apache.uima.resource.ResourceInitializationException: Error initializing org.apache.uima.resource.impl.DataResource_impl from descriptor file:..SnomedLookupAnnotator.xml CausedBy: org.apache.uima.resource.ResourceInitializationException: Could not access the resource data at file:org\apache\ctakes\dictionary\lookup2\Snomed2011ab_ctakesTui\cTake sSnomed.xml Now, I don't even have a lookup2 folder and, subsequently the Tui folder and cTakesSnomed.xml file. This seems to be the problem, but I'm not sure where these files are supposed to be grabbed from. On Mon, Aug 11, 2014 at 11:47 AM, Clayton Turner caturn...@g.cofc.edu wrote: Hi again: How exactly do you switch to using the cTakes dictionary-lookup-fast. Do I need to go in and alter xml files or is it as simple as adding a certain item to the list of analysis engines? On Fri, Aug 8, 2014 at 3:48 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Clayton, I don't know how the ytex dictionary lookup works, so I'm afraid that I can't help you with an answer. Maybe Vijay is the best person to do this. If you aren't tied to ytex you could try the new cTakes dictionary-lookup-fast. I tested Patient came in with a malar rash and it found malar and malar rash. Vijay, At some point the lookup-fast module will be the default for the cTakes clinical pipeline. In order to synchronize the ytex lookup with cTakes, would you like to eventually work together on reusing the same code for ytex? I have no idea what ytex does, but I know the ins and outs of the cdl-fast module. Sean -Original Message- From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of Clayton Turner Sent: Friday, August 08, 2014 2:08 PM To: dev@ctakes.apache.org Subject: v_snomed_fword_lookup view Hi Everyone: I have a question about how the v_snomed_fword_lookup view works when running the CPE
Youtube Channel Apache cTakes
cTakes now has a youtube channel named Apache cTakes. It is empty, but if you have ever made a training video, presentation on a component (descriptors, type system, etc.), or demo of integration with another system (UimaFit, Uima-AS, etc.) then please feel free to post on that channel. When there is content the Apache pages can have a link to the channel. Sean
RE: v_snomed_fword_lookup view
is the purpose of a CasConsumer to essentially save your data Correct, though it is a generic (and archaic) term indicating any end-user of the cas. -Original Message- From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of Clayton Turner Sent: Wednesday, August 13, 2014 2:10 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view Oh okay, so is the purpose of a CasConsumer to essentially save your data in a representation that you can do some kind of data mining or classification on it? If so, then I think I need to look into making/using one of those. On Wed, Aug 13, 2014 at 1:41 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Clayton, I'm glad that you got it working. Though I stated that I would, I haven't yet checked the fidelity of trunk. Urgent data request one day, must have writing the next ... and I still live with the delusion that I left academia to have free time ... I have never used ytex or weka, so I'm unfamiliar with all things .arff . Could it be that the ytex .arff exporter needs to change consumed cTakes annotation classes (3.1)? I have a custom CasConsumer that saves text spans and Cuis to file in a simple list, and that is what I used for the performance analysis of the lookup module. For our other projects here in Beantown we have other various outputs that fit the job at hand: text flat files, xml files, sql database tables, knot-encoded lace doilies, etc. I'm sure that none of the above helps you, but I felt obliged to provide some kind of answer to your question. Sean -Original Message- From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of Clayton Turner Sent: Wednesday, August 13, 2014 12:25 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view Okay, I believe I have ctakes dictionary fast working now. Something I'm curious about, though, is how you extract the data in order to conduct analysis. I've, in the past, been using the SparseDataExporterImpl from ytex in order to create a .arff file for use in weka, but the ctakes pipeline I'm using doesn't seem to be compatible with this ytex exporting as I'm not getting any cuis in my arff file. I'm using the aggregate plain text umls processor analysis engine from ctakes and then using the dbconsumer analysis engine from ytex (for storing into the database with regard to analysis batch). Any tips for exporting or some simple issue I'm missing? Thanks, Clayton On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja hsk5...@rit.edu wrote: Yes, absolutely and no problem at all. Regards, Harpreet On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Thanks Harpreet, That is definitely necessary to build! Those lines should already be in the pom, but commented out. I think that some version/branching issues may have arisen at some point wrt this module ... If somebody beats me to it then cheers, otherwise I will try to check out tonight and get all the bits in place. Sean -Original Message- From: Harpreet Khanduja [mailto:hsk5...@rit.edu] Sent: Monday, August 11, 2014 1:12 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view Hello Clayton, I do not know about ytex, but I did switch from dictionary-lookup to dictionary- lookup-fast. I update my ctakes-dictionary-lookup-fast project using maven. I think I used Team- Update and switched to the latest revision available and then I downloaded new 3.2 resources from the for umls. and then I added these resources to my ctakes-dictionary-lookup-fast resources folder and also the classpath in ctakes- clinical-pipeline. Then I changed the pom.xml file which belongs to the whole ctakes project and added dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-res/artifactId version${ctakes.version}/version /dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-fast/artifactId version${ctakes.version}/version /dependency these two dependencies to the file. After this, I also added the dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-fast/artifactId /dependency to the pom.xml of ctakes-clinical-pipeline. And then add the resources folder in ctakes-clinical-pipeline using build path configuration under add class option. After
RE: v_snomed_fword_lookup view
You can find example Cas Consumers in cTakes-core ..[dirPath]../cc/ -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Wednesday, August 13, 2014 2:20 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view There's nothing conceptually special about the consumer model vs. regular annotators (Analysis Engines). You can write an output format from any analysis engine as long as it is after the annotations you need in the pipeline. If you have global constraints (like in an ARFF file I think you need to know all the CUIs in your corpus to write the attribute list?), then it is important to use the process() method [called once per document] to store CUIs in a non-UIMA class variable (for example, a map from file id to a list/set/multiset of CUIs), and then use the collectionProcessComplete() [called once after all documents have been processed] method to do the actual writing of the file. Hope that is useful, sorry I couldn't tie it in to your previous YTEX exporter but I'm not familiar with that process. Tim On 08/13/2014 02:11 PM, Clayton Turner wrote: Oh okay, so is the purpose of a CasConsumer to essentially save your data in a representation that you can do some kind of data mining or classification on it? If so, then I think I need to look into making/using one of those. On Wed, Aug 13, 2014 at 1:41 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Clayton, I'm glad that you got it working. Though I stated that I would, I haven't yet checked the fidelity of trunk. Urgent data request one day, must have writing the next ... and I still live with the delusion that I left academia to have free time ... I have never used ytex or weka, so I'm unfamiliar with all things .arff . Could it be that the ytex .arff exporter needs to change consumed cTakes annotation classes (3.1)? I have a custom CasConsumer that saves text spans and Cuis to file in a simple list, and that is what I used for the performance analysis of the lookup module. For our other projects here in Beantown we have other various outputs that fit the job at hand: text flat files, xml files, sql database tables, knot-encoded lace doilies, etc. I'm sure that none of the above helps you, but I felt obliged to provide some kind of answer to your question. Sean -Original Message- From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of Clayton Turner Sent: Wednesday, August 13, 2014 12:25 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view Okay, I believe I have ctakes dictionary fast working now. Something I'm curious about, though, is how you extract the data in order to conduct analysis. I've, in the past, been using the SparseDataExporterImpl from ytex in order to create a .arff file for use in weka, but the ctakes pipeline I'm using doesn't seem to be compatible with this ytex exporting as I'm not getting any cuis in my arff file. I'm using the aggregate plain text umls processor analysis engine from ctakes and then using the dbconsumer analysis engine from ytex (for storing into the database with regard to analysis batch). Any tips for exporting or some simple issue I'm missing? Thanks, Clayton On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja hsk5...@rit.edu wrote: Yes, absolutely and no problem at all. Regards, Harpreet On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Thanks Harpreet, That is definitely necessary to build! Those lines should already be in the pom, but commented out. I think that some version/branching issues may have arisen at some point wrt this module ... If somebody beats me to it then cheers, otherwise I will try to check out tonight and get all the bits in place. Sean -Original Message- From: Harpreet Khanduja [mailto:hsk5...@rit.edu] Sent: Monday, August 11, 2014 1:12 PM To: dev@ctakes.apache.org Subject: Re: v_snomed_fword_lookup view Hello Clayton, I do not know about ytex, but I did switch from dictionary-lookup to dictionary- lookup-fast. I update my ctakes-dictionary-lookup-fast project using maven. I think I used Team- Update and switched to the latest revision available and then I downloaded new 3.2 resources from the for umls. and then I added these resources to my ctakes-dictionary-lookup-fast resources folder and also the classpath in ctakes- clinical-pipeline. Then I changed the pom.xml file which belongs to the whole ctakes project and added dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-res/artifactId version${ctakes.version}/version /dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes
RE: Web server
Hi John, Have you (or another) thought about modifying the Uima Simple Server to run a cTakes pipeline? http://uima.apache.org/sandbox.html#simple-server -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Thursday, August 21, 2014 3:06 PM To: dev@ctakes.apache.org Subject: Web server Im trying to deploy the cTakes web-server code someone already wrote (who wrote it btw?). Im running into deployment issues in eclipse with tomcat 7 on mac... I can get into details but for now: is it in a working state? Im learning as I go and it looks in order and the code is solid... Also, Pei: did they check in an LVG version that is thread safe now? Im really set on getting cTakes into a fluid RESTful interface. JG
RE: Ctakes to process 5000K recoreds
Hi Nick, I think that the bottleneck is probably the lookup module itself. So, I just sent you a secure email/ftp link. It contains a build of the new dictionary-lookup-fast module. Should you choose to try it, let me know how things turn out. Sean From: Nick Nikandish [snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 4:10 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds Thanks, let me try it. Nick -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Tuesday, September 09, 2014 4:08 PM To: 'dev@ctakes.apache.org' Subject: RE: Ctakes to process 5000K recoreds If you just need the medication names, you can remove these: nodeContextDependentTokenizerAnnotator/node nodeDependencyParser/node nodeAssertionAnnotator/node You might be able to get rid of the LvgAnnotator and still get decent results since variations of word form should not affect medication names. I would try with it and without it on a smaller set of files and see if you see a difference. I believe the others are needed by the default configs for medication lookup. For example, POS is used to get phrase type. Phrases are used to remove verb phrases from the lookup and also therefore to keep the lookup windows from getting too big. I'm more familiar with the other types of named entities (diseases, symptoms, etc) than with medications. -Original Message- From: Nick Nikandish [mailto:snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 3:01 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds James, Do you have any suggestion about running cTakes with minimum annotators that can return Medications in DictionaryLookupAnnotator? Thanks, Nick -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Tuesday, September 09, 2014 3:05 PM To: 'dev@ctakes.apache.org' Subject: RE: Ctakes to process 5000K recoreds I suspect that when you take out simple segment annotated, nothing is getting processed, and that is why it appears so fast. At least some of the annotators loop through the list of sections/segments, which is why there is a simple segment annotator - so that there is at least one section/segment identified. Are you getting any annotations at all? -Original Message- From: Nick Nikandish [mailto:snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 2:02 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds Pei, I need the name of the medications for the application that I wrote and uses ctakes.so I cache the medication in DictionaryLookupAnnotator(in performLookup()) and use them in my program but when I have SimpleSegementAnnotator it just takes forever. After taking SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is returned in the code. So I was wondering if there was a way that I could eliminate SimpleSegementAnnotator but still be able to get the medications name in that class? Nick -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Tuesday, September 09, 2014 2:54 PM To: dev@ctakes.apache.org Subject: Re: Ctakes to process 5000K recoreds Nick, When you mean no medication is being annotated, I presume you mean the medication attributes (i.e. dosage, frequency, etc.) are not being annotated? I think the DrugNER needs a list of section names in the config; I think it includes SIMPLE_SEGMENT. I am very surprised that SimpleSegementAnnotator is the bottle neck though; all it does is assume the entire document is a single section called SIMPLE_SEGMENT. Have you tried commenting out the DependencyParser if you're not using those features. --Pei On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish snika...@emerginghealthit.com wrote: Hi there, I am using Ctakes to process 5000K free text records where each record has several medications. This is the fixed flow that it goes through: nodeSimpleSegmentAnnotator/node nodeSentenceDetectorAnnotator/node nodeTokenizerAnnotator/node nodeLvgAnnotator/node nodeContextDependentTokenizerAnnotator/node nodePOSTagger/node nodeChunker/node nodeLookupWindowAnnotator/node nodeDictionaryLookupAnnotatorDB/node
RE: Ctakes to process 5000K recoreds
Just use it with cTakes. Instead of removing other modules from the pipeline, replace the dictionary-lookup with dictionary-lookup-fast. For the desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml , you would modify: delegateAnalysisEngine key=DictionaryLookupAnnotatorDB import location=../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml/ /delegateAnalysisEngine To be: delegateAnalysisEngine key=DictionaryLookupAnnotatorDB import location=../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml/ /delegateAnalysisEngine That should be it. You can then leave the rest of the module specifications alone. Sean From: Nick Nikandish [snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 4:32 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds Hi Sean, Many thanks, I will try it tomorrow. Do you have any special instruction to run that scrip or I have to use it with cTakes? Thanks, Nick -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Tuesday, September 09, 2014 4:24 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds Hi Nick, I think that the bottleneck is probably the lookup module itself. So, I just sent you a secure email/ftp link. It contains a build of the new dictionary-lookup-fast module. Should you choose to try it, let me know how things turn out. Sean From: Nick Nikandish [snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 4:10 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds Thanks, let me try it. Nick -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Tuesday, September 09, 2014 4:08 PM To: 'dev@ctakes.apache.org' Subject: RE: Ctakes to process 5000K recoreds If you just need the medication names, you can remove these: nodeContextDependentTokenizerAnnotator/node nodeDependencyParser/node nodeAssertionAnnotator/node You might be able to get rid of the LvgAnnotator and still get decent results since variations of word form should not affect medication names. I would try with it and without it on a smaller set of files and see if you see a difference. I believe the others are needed by the default configs for medication lookup. For example, POS is used to get phrase type. Phrases are used to remove verb phrases from the lookup and also therefore to keep the lookup windows from getting too big. I'm more familiar with the other types of named entities (diseases, symptoms, etc) than with medications. -Original Message- From: Nick Nikandish [mailto:snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 3:01 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds James, Do you have any suggestion about running cTakes with minimum annotators that can return Medications in DictionaryLookupAnnotator? Thanks, Nick -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Tuesday, September 09, 2014 3:05 PM To: 'dev@ctakes.apache.org' Subject: RE: Ctakes to process 5000K recoreds I suspect that when you take out simple segment annotated, nothing is getting processed, and that is why it appears so fast. At least some of the annotators loop through the list of sections/segments, which is why there is a simple segment annotator - so that there is at least one section/segment identified. Are you getting any annotations at all? -Original Message- From: Nick Nikandish [mailto:snika...@emerginghealthit.com] Sent: Tuesday, September 09, 2014 2:02 PM To: dev@ctakes.apache.org Subject: RE: Ctakes to process 5000K recoreds Pei, I need the name of the medications for the application that I wrote and uses ctakes.so I cache the medication in DictionaryLookupAnnotator(in performLookup()) and use them in my program but when I have SimpleSegementAnnotator it just takes forever. After taking SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is returned in the code. So I was wondering if there was a way that I could eliminate SimpleSegementAnnotator but still be able to get the medications name in that class? Nick -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Tuesday, September 09, 2014 2:54 PM To: dev@ctakes.apache.org Subject: Re: Ctakes to process 5000K recoreds Nick, When you mean no medication is being annotated, I presume you mean the medication attributes (i.e. dosage, frequency, etc.) are not being annotated? I think the DrugNER needs a list of section names in the config; I think it includes SIMPLE_SEGMENT. I am very surprised that SimpleSegementAnnotator is the bottle neck though; all it does is assume the entire document is a single
RE: cTakes output predictability
Steve Bethard wrote: I spent some time writing a script for diff-ing CASes I urge anyone interested in comparing cTakes CASes / output to use this type of approach. Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there. Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved. I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference. In addition to or instead of creating a post-processing script, one could write a new cas-consumer that writes output in a desired format - but this should not require changes to engines. If it ain't broke, don't fix it Sean -Original Message- From: Steven Bethard [mailto:steven.beth...@gmail.com] Sent: Monday, October 06, 2014 11:23 PM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen bruce.tiet...@perfectsearchcorp.com wrote: Since I started working with cTakes some time ago, I have found it difficult to compare the output between subsequent runs on the same files because annotations are often assigned different IDs, are listed in different order, etc. At one point, I spent some time writing a script for diff-ing CASes that intended to address some of these kinds of issues. It's still here in cTAKES: ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java You might see if you could use or adapt that to your needs. Steve
RE: cTakes output predictability
Hi Kim, One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not. Such a change in sentence splitting would not only effect the sentence type discoveries but also practically every type that follows. Another might want to compare a note with skin cancer vs. one in which you replace skin cancer with melanoma just to see what the CUI differences might be. There are changes in two words vs. one, 11 characters vs. 8, a removed adjective(?), and of course changes in CUIs. Of course, if you are just running notes on a new moon and then again on a full moon ... Sean -Original Message- From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Tuesday, October 07, 2014 10:41 AM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability Sean, ...being different because of a possibly intentional difference. I would like you to elaborate a bit on the what would be intentionally different between the processing of the same document multiple times. It would help my understanding of cTakes. Thanks, Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 07:30 AM, Finan, Sean wrote: Steve Bethard wrote: I spent some time writing a script for diff-ing CASes I urge anyone interested in comparing cTakes CASes / output to use this type of approach. Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there. Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved. I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference. In addition to or instead of creating a post-processing script, one could write a new cas-consumer that writes output in a desired format - but this should not require changes to engines. If it ain't broke, don't fix it Sean -Original Message- From: Steven Bethard [mailto:steven.beth...@gmail.com] Sent: Monday, October 06, 2014 11:23 PM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen bruce.tiet...@perfectsearchcorp.com wrote: Since I started working with cTakes some time ago, I have found it difficult to compare the output between subsequent runs on the same files because annotations are often assigned different IDs, are listed in different order, etc. At one point, I spent some time writing a script for diff-ing CASes that intended to address some of these kinds of issues. It's still here in cTAKES: ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis /CompareFeatureStructures.java You might see if you could use or adapt that to your needs. Steve
RE: cTakes output predictability
Hi Kim, It concerns me a bit by making the code return consistent results would be so concerning. Could you please clarify what you mean by consistent results? Do you mean ordering and IDs or are you talking about actual type values not matching? This should be the default mode of operation. Depending upon what you meant above, I may agree or disagree. Since it doesn't appear that there are any consequences with moving forward with changing the code Why do you say this? I think that there may be more required changes than you realize. Every insertion into the CAS must be of ordered data. This means that, for instance, named entities discovered by dictionary will need to be inserted in some predictable order, such as by alphabetized cui per every alphabetized tui (and other code) per ordered text span. You will need to check and recheck every point at which the CAS is modified by every module. Right now there are at least three or four places in two cTakes dictionary modules where a change would be required - and that doesn't include YTEX lookup. If you really feel strongly about this and are going to change cTakes code, then I suggest (at the risk of sounding like a complete jerk) that you also consider the following: 1. Don't check anything into trunk until all is well with your changes and tests Just in case you abandon the effort 2. Write unit tests for every change True, Map to LinkedMap shouldn't break anything, but they are good to have, and may prevent others in the future from switching back to a non-linked map or any unordered collection (set not list, etc.). It also makes a better place for explanation in Javadoc than inlines above the code. 3. Run memory requirement tests before all of your changes and then again after your changes I'm actually curious about how much memory might be eaten with linkages everywhere 4. Run performance (speed) tests before and after On a large corpus to ensure that garbage collection is involved 5. Do the above with every combination possible in current workflows: every combination of available sentence detector, pos tagger, smoking status detector, dictionary lookup, cas consumer, etc. As soon as somebody says all output is consistently ordered between runs it had better be so for every possible workflow 6. Write system tests to ensure ordered/predicted outputs with each combination Otherwise somebody may break it 7. Document the what, how, and why for future development Otherwise somebody won't know to stick to the new rules 8. Assist anybody as needed that in the future breaks one of these unit or system tests with a fix or new feature By mandating such a rule you are assuming responsibility for it 9. Assist anybody as needed that in the future adds a new module or workflow to cTakes to abide by the ordering requirement By mandating such a rule you are assuming responsibility for it 10. Assist anybody as needed that in the future adds a new module or workflow to add system tests to ensure maintenance of the ordering requirement By mandating such a rule you are assuming responsibility for it -Original Message- From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Tuesday, October 07, 2014 11:57 AM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability I think we may really prefer the first method. Since it doesn't appear that there are any consequences with moving forward with changing the code, we would really like to move forward with this approach. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 09:35 AM, britt fitch wrote: The option Sean mentioned of writing your own custom consumer (without the UIMA id that is causing your issues) should meet these needs I believe. Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com britt.fi...@wiredinformatics.com On Oct 7, 2014, at 11:29 AM, Kim Ebert kim.eb...@perfectsearchcorp.com mailto:kim.eb...@perfectsearchcorp.com wrote: Hi Sean, Well of course that makes plenty of sense. Testing different cTakes configurations you would expect different output. In our testing we've found several cases where running with the same configuration outputs different data under different moons. Having consistent results helps us know if we've made improvements to our quality or not. Having output that is in a predictable order makes checking to see if there are differences much cheaper when you are dealing with larger data sets. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 08:50 AM, Finan, Sean wrote: Hi Kim, One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not. Such a change in sentence splitting would not only effect the sentence type discoveries but also
RE: cTakes output predictability
, Kim Ebert kim.eb...@perfectsearchcorp.com wrote: Hi Sean, No, your not a jerk. These are things worth considering, and I understand your concerns with touching various points of the codebase. I'll talk with our group over here and see where we want to go. We are really interested in cTakes behaving well, so we are usually pretty careful in testing our changes before committing anything. Thanks, Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 10:46 AM, Finan, Sean wrote: Hi Kim, It concerns me a bit by making the code return consistent results would be so concerning. Could you please clarify what you mean by consistent results? Do you mean ordering and IDs or are you talking about actual type values not matching? This should be the default mode of operation. Depending upon what you meant above, I may agree or disagree. Since it doesn't appear that there are any consequences with moving forward with changing the code Why do you say this? I think that there may be more required changes than you realize. Every insertion into the CAS must be of ordered data. This means that, for instance, named entities discovered by dictionary will need to be inserted in some predictable order, such as by alphabetized cui per every alphabetized tui (and other code) per ordered text span. You will need to check and recheck every point at which the CAS is modified by every module. Right now there are at least three or four places in two cTakes dictionary modules where a change would be required - and that doesn't include YTEX lookup. If you really feel strongly about this and are going to change cTakes code, then I suggest (at the risk of sounding like a complete jerk) that you also consider the following: 1. Don't check anything into trunk until all is well with your changes and tests Just in case you abandon the effort 2. Write unit tests for every change True, Map to LinkedMap shouldn't break anything, but they are good to have, and may prevent others in the future from switching back to a non-linked map or any unordered collection (set not list, etc.). It also makes a better place for explanation in Javadoc than inlines above the code. 3. Run memory requirement tests before all of your changes and then again after your changes I'm actually curious about how much memory might be eaten with linkages everywhere 4. Run performance (speed) tests before and after On a large corpus to ensure that garbage collection is involved 5. Do the above with every combination possible in current workflows: every combination of available sentence detector, pos tagger, smoking status detector, dictionary lookup, cas consumer, etc. As soon as somebody says all output is consistently ordered between runs it had better be so for every possible workflow 6. Write system tests to ensure ordered/predicted outputs with each combination Otherwise somebody may break it 7. Document the what, how, and why for future development Otherwise somebody won't know to stick to the new rules 8. Assist anybody as needed that in the future breaks one of these unit or system tests with a fix or new feature By mandating such a rule you are assuming responsibility for it 9. Assist anybody as needed that in the future adds a new module or workflow to cTakes to abide by the ordering requirement By mandating such a rule you are assuming responsibility for it 10. Assist anybody as needed that in the future adds a new module or workflow to add system tests to ensure maintenance of the ordering requirement By mandating such a rule you are assuming responsibility for it -Original Message- From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Tuesday, October 07, 2014 11:57 AM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability I think we may really prefer the first method. Since it doesn't appear that there are any consequences with moving forward with changing the code, we would really like to move forward with this approach. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 09:35 AM, britt fitch wrote: The option Sean mentioned of writing your own custom consumer (without the UIMA id that is causing your issues) should meet these needs I believe. Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com britt.fi...@wiredinformatics.com On Oct 7, 2014, at 11:29 AM, Kim Ebert kim.eb...@perfectsearchcorp.com mailto:kim.eb...@perfectsearchcorp.com wrote: Hi Sean, Well of course that makes plenty of sense. Testing different cTakes configurations you would expect different output. In our testing we've found several cases where running with the same configuration
RE: cTakes output predictability
); } } This will at most return one item from the Set. Since the set is an unordered hash, this will result in one of three options to be returned. Is this a bug, or a design decision. Which one is right? Which one is wrong? It maybe this is a disign decision, but it would be nice if we are consistently right, or consistently wrong. Many other instances of this result in similar issues. Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 12:43 PM, Finan, Sean wrote: I'm just about sapped on this topic. What comes below is my final writing. Kim wrote: Yes, I mean actual type values not matching. Ok, this is a very serious problem and should have nothing to do with ordering and/or IDs. I repeat: this should have nothing to do with ordering or ids. Reordering or changing ID assignment, while possibly producing repeatable output, will not necessary fix the actual bug. Please write a Jira for each item, and (imo) we should think about withholding any non-bug-fix release until they have been dealt with. Bruce wrote: I did not intend to step on anyone's toes. No worries - I don't think that any toes have been stepped upon. It is good that questions and concerns are shared with the group. Note that in the first instance, there were two MedicationMentions, but in the second, there is only one. Assuming that the second drug mention doesn't appear elsewhere in output2 then this needs to be addressed. Please log a tar. Relating this to the order/id issue, which number of mentions is correct (2)? If you reorder will that consistently output two medications instead of one or one medication instead of two? This is most likely a bug in the identification and/or storage and/or retrieval code and needs to be fixed there. Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier? I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations could be started and people could add to it as needed. I would also hope that a reusable post-process comparison utility could be started and improved/maintained. Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Tuesday, October 07, 2014 1:21 PM To: dev@ctakes.apache.org Subject: Re: cTakes output predictability I did not intend to step on anyone's toes. One of the reasons I proposed the changes was to try to make it extremely obvious when there are significant difference in output from the cTakes pipeline when running the same document again, and once identified, make it easier to identify the source of the difference. Because of the huge number of differences between the output using the FileWriterCasConsumer.xml, first detecting that there is a significant differences and identifying them for a large set of documents is a daunting task. The following is an example of some significant differences that I have detected between two subsequent runs on the same document using the current release of cTakes. (There are actually quite a few documents that exhibit this kind of behavior. This is only one example.) Snippet from first run: org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation _indexed=1 _id=9869 _ref_sofa=3 begin=3039 end=3047/ org.apache.ctakes.typesystem.type.textsem.MedicationMention _indexed=1 _id=9895 _ref_sofa=3 begin=2075 end=2081 id=95 _ref_ontologyConceptArr=9891 typeID=1 segmentID=SIMPLE_SEGMENT discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1 conditional=false generic=true subject=patient historyOf=0/ org.apache.ctakes.typesystem.type.textsem.MedicationMention _indexed=1 _id=9937 _ref_sofa=3 begin=2312 end=2322 id=110 _ref_ontologyConceptArr=9934 typeID=1 segmentID=SIMPLE_SEGMENT discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1 conditional=false generic=false subject=patient historyOf=0/ org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention _indexed=1 _id=9979 _ref_sofa=3 begin=0 end=4 id=0 _ref_ontologyConceptArr=9976 typeID=2 segmentID=SIMPLE_SEGMENT discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=0 conditional=false generic=false subject=patient historyOf=0/ Snippet from subsequent trun: org.apache.ctakes.typesystem.type.textsem.ProcedureMention _indexed=1 _id=15773 _ref_sofa=3 begin=2929 end=2933 id=125 _ref_ontologyConceptArr=15770 typeID=5 segmentID=SIMPLE_SEGMENT discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=0 conditional=false generic=false subject=patient historyOf=0/ org.apache.ctakes.typesystem.type.textsem.MedicationMention _indexed=1 _id=15928 _ref_sofa=3 begin=2075 end=2081 id=95 _ref_ontologyConceptArr=15924 typeID=1 segmentID=SIMPLE_SEGMENT discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1 conditional=false generic=true subject=patient
RE: Differences in MedicationMention annotations on subsequent processing runs
Hi Bruce, I would venture to say that this is neither expected nor desired. Before you fix it (or in addition to a fix), try to run with the new dictionary lookup. It will have a different behavior, and it will be the default dictionary lookup in future releases of cTakes – making fixes to the current module slightly less urgent. Sean From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Wednesday, October 08, 2014 11:38 AM To: dev@ctakes.apache.org Subject: Differences in MedicationMention annotations on subsequent processing runs I have encountered a situation in which the cTakes clinical pipeline output differs between multiple runs on the same text with the same configuration. The following snippets from a single document are sufficient to demonstrate the issue: a gentle curve going into. irrigated with Bacitracin. The source of the difference is that the DictionaryLookupAnnotator uses a map to filter out duplicate annotations for a single document location: // used to prevent duplicate hits // key = hit begin,end key (java.lang.String) // val = Set of MetaDataHit objects private MapString,SetMetaDataHit iv_dupMap = new HashMap(); This map is shared between both the umls_ms_2011ab lookup and the umls_ms_2011an_rxnorm lookup, If both dictionaries contain the same term, the order of dictionary lookup execution determines the output.If the rxnorm lookup runs first, then a MedicationMention annotation for Bacitracin appears in the final output. If the standard umls lookup runs first, then there is no MedicationMention annotation for Bacitracin. I will attach the output from the subsequent runs. (Hopefully the attachment will make it through the system) Is this expected behavior? If not, what would be the expected behavior? [Image removed by sender. IMAT Solutions]http://imatsolutions.com Bruce Tietjen Senior Software Engineer [Image removed by sender. Mobile:]801.634.1547 bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com
RE: Differences in MedicationMention annotations on subsequent processing runs
Good point ... I tried to check in to sourceforge but had problems. I will try again right now ... Building a custom dictionary is possible with the DictionaryTool in cTakes sandbox, but that is a different rabbit hole. -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Wednesday, October 08, 2014 11:52 AM To: dev@ctakes.apache.org Subject: Re: Differences in MedicationMention annotations on subsequent processing runs If I understand correctly, I would need new dictionary resources to run the rare word lookup method. Where can I find the necessary dictionary(ies) or how do I build them? [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Bruce, I would venture to say that this is neither expected nor desired. Before you fix it (or in addition to a fix), try to run with the new dictionary lookup. It will have a different behavior, and it will be the default dictionary lookup in future releases of cTakes – making fixes to the current module slightly less urgent. Sean *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] *Sent:* Wednesday, October 08, 2014 11:38 AM *To:* dev@ctakes.apache.org *Subject:* Differences in MedicationMention annotations on subsequent processing runs I have encountered a situation in which the cTakes clinical pipeline output differs between multiple runs on the same text with the same configuration. The following snippets from a single document are sufficient to demonstrate the issue: a gentle curve going into. irrigated with Bacitracin. The source of the difference is that the DictionaryLookupAnnotator uses a map to filter out duplicate annotations for a single document location: // used to prevent duplicate hits // key = hit begin,end key (java.lang.String) // val = Set of MetaDataHit objects private MapString,SetMetaDataHit iv_dupMap = new HashMap(); This map is shared between both the umls_ms_2011ab lookup and the umls_ms_2011an_rxnorm lookup, If both dictionaries contain the same term, the order of dictionary lookup execution determines the output.If the rxnorm lookup runs first, then a MedicationMention annotation for Bacitracin appears in the final output. If the standard umls lookup runs first, then there is no MedicationMention annotation for Bacitracin. I will attach the output from the subsequent runs. (Hopefully the attachment will make it through the system) Is this expected behavior? If not, what would be the expected behavior? [image: Image removed by sender. IMAT Solutions] http://imatsolutions.com *Bruce Tietjen* Senior Software Engineer [image: Image removed by sender. Mobile:]801.634.1547 bruce.tiet...@imatsolutions.com
RE: Differences in MedicationMention annotations on subsequent processing runs
Hi Bruce, With Pei's help I just updated the sourceforge repo with the cTakes dictionaries. Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Wednesday, October 08, 2014 11:52 AM To: dev@ctakes.apache.org Subject: Re: Differences in MedicationMention annotations on subsequent processing runs If I understand correctly, I would need new dictionary resources to run the rare word lookup method. Where can I find the necessary dictionary(ies) or how do I build them? [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Bruce, I would venture to say that this is neither expected nor desired. Before you fix it (or in addition to a fix), try to run with the new dictionary lookup. It will have a different behavior, and it will be the default dictionary lookup in future releases of cTakes – making fixes to the current module slightly less urgent. Sean *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] *Sent:* Wednesday, October 08, 2014 11:38 AM *To:* dev@ctakes.apache.org *Subject:* Differences in MedicationMention annotations on subsequent processing runs I have encountered a situation in which the cTakes clinical pipeline output differs between multiple runs on the same text with the same configuration. The following snippets from a single document are sufficient to demonstrate the issue: a gentle curve going into. irrigated with Bacitracin. The source of the difference is that the DictionaryLookupAnnotator uses a map to filter out duplicate annotations for a single document location: // used to prevent duplicate hits // key = hit begin,end key (java.lang.String) // val = Set of MetaDataHit objects private MapString,SetMetaDataHit iv_dupMap = new HashMap(); This map is shared between both the umls_ms_2011ab lookup and the umls_ms_2011an_rxnorm lookup, If both dictionaries contain the same term, the order of dictionary lookup execution determines the output.If the rxnorm lookup runs first, then a MedicationMention annotation for Bacitracin appears in the final output. If the standard umls lookup runs first, then there is no MedicationMention annotation for Bacitracin. I will attach the output from the subsequent runs. (Hopefully the attachment will make it through the system) Is this expected behavior? If not, what would be the expected behavior? [image: Image removed by sender. IMAT Solutions] http://imatsolutions.com *Bruce Tietjen* Senior Software Engineer [image: Image removed by sender. Mobile:]801.634.1547 bruce.tiet...@imatsolutions.com
RE: Differences in MedicationMention annotations on subsequent processing runs
DictionaryLookupAnnotator which is a container for the dictionaries and it iterates through the list of lookup dictionaries I am confused. The new dictionary-lookup-fast has neither this class nor multiple dictionaries. The umls and rxnorm are in the same database table and lookup is performed in one swoop. Could you please send a copy of your pipeline xmls to me directly (instead of bombing the group) with something other than an .xml extension (they get blocked)? From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com] Sent: Thursday, October 09, 2014 11:41 AM To: dev@ctakes.apache.org Subject: Re: Differences in MedicationMention annotations on subsequent processing runs I tried the Dictionary-lookup-fast module and the bahavior is the same. I did have to run it a number of times before timing was right to reproduce the issue. With the older lookup, chances were about 50/50 between which dictionary ran first. Using the dictionary-fast, it seems more like 70/30 with the standard umls lookup being more likely to run first than not. Which means that most of the time, there is no MedicationMention annotation for Bacitracin. (See Attached) The code with the issue is the DictionaryLookupAnnotator which is a container for the dictionaries and it iterates through the list of lookup dictionaries so that part of the code path does not seem to have changed. In the past, the rxNorm dictionary was a Lucene search and so I'm guessing it behaved a little differently than it does now with both being JDBC. The fact that the filter is at this location seems to indicate that it may have been by intended for it to be across all dictionaries. On the other hand, it appears to mask out the lookups for the different dictionaries, resulting in some annotations not being made. So, the real question is how should the filter work -- should the annotation filtering be per lookup dictionary, or be across all dictionaries? Or is there something wrong elsewhere that causes I lean towards having the filter function per dictionary. This may risk having duplicate annotations, but that would probably be better than missing the annotation all together. [IMAT Solutions]http://imatsolutions.com Bruce Tietjen Senior Software Engineer [Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com On Wed, Oct 8, 2014 at 10:02 AM, Finan, Sean sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu wrote: Hi Bruce, With Pei's help I just updated the sourceforge repo with the cTakes dictionaries. Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com] Sent: Wednesday, October 08, 2014 11:52 AM To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org Subject: Re: Differences in MedicationMention annotations on subsequent processing runs If I understand correctly, I would need new dictionary resources to run the rare word lookup method. Where can I find the necessary dictionary(ies) or how do I build them? [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547tel:801.634.1547 bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu wrote: Hi Bruce, I would venture to say that this is neither expected nor desired. Before you fix it (or in addition to a fix), try to run with the new dictionary lookup. It will have a different behavior, and it will be the default dictionary lookup in future releases of cTakes – making fixes to the current module slightly less urgent. Sean *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com] *Sent:* Wednesday, October 08, 2014 11:38 AM *To:* dev@ctakes.apache.orgmailto:dev@ctakes.apache.org *Subject:* Differences in MedicationMention annotations on subsequent processing runs I have encountered a situation in which the cTakes clinical pipeline output differs between multiple runs on the same text with the same configuration. The following snippets from a single document are sufficient to demonstrate the issue: a gentle curve going into. irrigated with Bacitracin. The source of the difference is that the DictionaryLookupAnnotator uses a map to filter out duplicate annotations for a single document location: // used to prevent duplicate hits // key = hit begin,end key (java.lang.String) // val = Set of MetaDataHit objects private MapString,SetMetaDataHit iv_dupMap = new HashMap(); This map is shared between both the umls_ms_2011ab lookup and the umls_ms_2011an_rxnorm lookup, If both dictionaries contain the same term
RE: Differences in MedicationMention annotations on subsequent processing runs
I just ran the –fast with an example containing bacitracin in four sentences, once being the first word and once being the last. In ten of ten runs all four bacitracin mentions were discovered. You completely replaced the dictionary lookup with ? delegateAnalysisEngine key=DictionaryLookupAnnotatorDB import location=../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml/ /delegateAnalysisEngine From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Thursday, October 09, 2014 11:42 AM To: dev@ctakes.apache.org Subject: Re: Differences in MedicationMention annotations on subsequent processing runs I tried the Dictionary-lookup-fast module and the bahavior is the same. I did have to run it a number of times before timing was right to reproduce the issue. With the older lookup, chances were about 50/50 between which dictionary ran first. Using the dictionary-fast, it seems more like 70/30 with the standard umls lookup being more likely to run first than not. Which means that most of the time, there is no MedicationMention annotation for Bacitracin. (See Attached) The code with the issue is the DictionaryLookupAnnotator which is a container for the dictionaries and it iterates through the list of lookup dictionaries so that part of the code path does not seem to have changed. In the past, the rxNorm dictionary was a Lucene search and so I'm guessing it behaved a little differently than it does now with both being JDBC. The fact that the filter is at this location seems to indicate that it may have been by intended for it to be across all dictionaries. On the other hand, it appears to mask out the lookups for the different dictionaries, resulting in some annotations not being made. So, the real question is how should the filter work -- should the annotation filtering be per lookup dictionary, or be across all dictionaries? Or is there something wrong elsewhere that causes I lean towards having the filter function per dictionary. This may risk having duplicate annotations, but that would probably be better than missing the annotation all together. [IMAT Solutions]http://imatsolutions.com Bruce Tietjen Senior Software Engineer [Mobile:]801.634.1547 bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com On Wed, Oct 8, 2014 at 10:02 AM, Finan, Sean sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu wrote: Hi Bruce, With Pei's help I just updated the sourceforge repo with the cTakes dictionaries. Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com] Sent: Wednesday, October 08, 2014 11:52 AM To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org Subject: Re: Differences in MedicationMention annotations on subsequent processing runs If I understand correctly, I would need new dictionary resources to run the rare word lookup method. Where can I find the necessary dictionary(ies) or how do I build them? [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547tel:801.634.1547 bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean sean.fi...@childrens.harvard.edumailto:sean.fi...@childrens.harvard.edu wrote: Hi Bruce, I would venture to say that this is neither expected nor desired. Before you fix it (or in addition to a fix), try to run with the new dictionary lookup. It will have a different behavior, and it will be the default dictionary lookup in future releases of cTakes – making fixes to the current module slightly less urgent. Sean *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.commailto:bruce.tiet...@perfectsearchcorp.com] *Sent:* Wednesday, October 08, 2014 11:38 AM *To:* dev@ctakes.apache.orgmailto:dev@ctakes.apache.org *Subject:* Differences in MedicationMention annotations on subsequent processing runs I have encountered a situation in which the cTakes clinical pipeline output differs between multiple runs on the same text with the same configuration. The following snippets from a single document are sufficient to demonstrate the issue: a gentle curve going into. irrigated with Bacitracin. The source of the difference is that the DictionaryLookupAnnotator uses a map to filter out duplicate annotations for a single document location: // used to prevent duplicate hits // key = hit begin,end key (java.lang.String) // val = Set of MetaDataHit objects private MapString,SetMetaDataHit iv_dupMap = new HashMap(); This map is shared between both the umls_ms_2011ab lookup and the umls_ms_2011an_rxnorm lookup, If both dictionaries contain the same term, the order of dictionary lookup execution determines the output.If
RE: Need information regarding cTakes changes
Hi Chandu, For your note #2: 2)Any new features that can be added to current version of cTakes project to make it more useful. You can always check (or add to) the Jira future enhancement page at: https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323040/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel Sean -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Monday, October 20, 2014 2:40 PM To: dev@ctakes.apache.org Subject: Re: Need information regarding cTakes changes On 10/17/2014 05:23 AM, sarath chandra Reddy wrote: Hi, I am not proposing any changes, as I did not have much knowledge about the cTakes project code. I am requesting the persons who are currently working on the development of cTakes next version.I need their help in answering the questions mentioned in previous mail. 1) Any possible improvements that can be made to current cTakes version to improve its efficiency ?. Like code-level and design level changes. Well, the new fast dictionary module should solve one of the biggest issues, the bottleneck of the dictionary lookup. Beyond that, it would be nice to decrease the memory footprint of the dependency parser. 2)Any new features that can be added to current version of cTakes project to make it more useful. Using UIMA-AS allows for scaleout, in combination with the fast dictionary can allow very fast processing. Maybe it's not a feature per se, and maybe it will come from an outside project, but I think infrastructure that makes it easy to get a highly parallel and very fast version of ctakes up and running would be a nice addition. (By the way, that's just one interesting example that came to mind, not necessarily the most important or highest priority!) Tim I humbly request the developers to provide me information regarding these. Regards, Chandu On Thu, Oct 16, 2014 at 8:31 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: Chanda, Could you describe what types of changes you are proposing. We'll welcome any contributions. Sent from my iPhone On Oct 16, 2014, at 5:21 PM, sarath chandra Reddy jscredd...@gmail.com wrote: Hi, I am doing a research work on cTakes . I request the developers working on the development of cTakes project to answer the following questions. Connect me with the right persons. --I need three major possible improvements to the cTakes current --design Also three new features that can be added to the current --cTakes project I am waiting for your responses. Thank you in advance. Regards, Chandu
RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download
Hi Andy, Great stuff! I think that I understand the method, but I have a question about the statement: the content is publicly available per the NCBI policy and license for MedGen sources Does this mean that I, Joe Anybody, could download the content, place some of the content in a database structured in my own fashion, package the -new- database, and include it in a cTakes distribution? Or, does it mean that content downloaded by script is usable as-is and only as-is? The whole if I'd known your were going to do that I wouldn't have given it to you ... Thanks, Sean From: andy mcmurry [mcmurry.a...@gmail.com] Sent: Thursday, November 13, 2014 6:59 PM To: dev@ctakes.apache.org Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download Pei: Yes, specifically: The source code was released by Invitae under Apache ASL 2.0 per my request and with full blessing from our legal counsel and software team. I also reviewed in principle the idea with John Wilbanks of Sage Bionetworks (and formerly creative commons). This is legit, or I wouldn't have spent tons of hours doing it. The raw content is a set of scripts which wget a list of URLS from the NCBI public FTP repositories. This code DOES NOT redistribute any content whatsoever, just a list of URLs to download, unzip, and insert into a local mysql database. To repeat: I am NOT circulating any content, just URL links -- you must download the content yourself. And that is the beauty -- all content is downloaded BY THE USER and the content is publicly available per the NCBI policy and license for MedGen sources. On Thu, Nov 13, 2014 at 11:18 AM, Chen, Pei pei.c...@childrens.harvard.edu wrote: John- I believe that was the thinking. Andy- Just to confirm- Is the raw content of this dataset released under ASL2.0? i.e. can you contribute it as a CSV or similar so that cTAKES may re-tokenize it using the same PTB rules, format it for cTAKES' dictionary lookup, etc., and then redistribute it under the same License. -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Thursday, November 13, 2014 1:55 PM To: dev@ctakes.apache.org Cc: dev@ctakes.apache.org Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download The old licensed setup would be kept as a packaged option? Much as it is now With the unlicensed going out in place of the current free dictionary? Am I understanding that right? JG — Sent from Mailbox On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry mcmurry.a...@gmail.com wrote: I'll crunch the numbers -- in the meantime I can tell you that phenotypes vary by semantic type. clinical attributes from SNOMED are abundant, many concepts in mesh that are mapped to diseases. Tons of pharmacological substances On Nov 12, 2014 6:19 AM, Dligach, Dmitriy dmitriy.dlig...@childrens.harvard.edu wrote: Andy, thank you for this resource! Do you have an estimate of what percentage of UMLS concepts were left out? Dima On Nov 11, 2014, at 16:02, andy mcmurry mcmurry.a...@gmail.com wrote: Hello! https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2) We just released a new library containing a huge chunk of UMLS concepts which are available without registering accounts/username/passwords. LEGALLY. Yes, really! The subset is from NCBI and it contains *thousands of concepts from SNOMED and other vocabularies*. The code is essentially 1. a list of WGET targets to various NCBI FTP site mirrors 2. Makefile for building the databases of interest Our legal team has approved distribution for Open Access work, ASL2 LICENSE. I recommend we use this opportunity to make this the default distribution for CTAKES UMLS connections, because it obviates the need for so much painful credentialing and back and forth agreements with the US National Library of Medicine. Cheers! --Andy On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. masanz.ja...@mayo.edu wrote: I would love to see the install be as simple as apt-get install to end up with some working dictionary that have more than a handful of entries to get them started. Regards, James Masanz -Original Message- From: andy mcmurry [mailto:mcmurry.a...@gmail.com] Sent: Tuesday, September 09, 2014 4:32 PM To: ctakes-...@incubator.apache.org Subject: Recommendation for ctakes default (UMLS) dictionaries Greetings ctakes-dev: *UMLS license restrictions have been getting more lax over the years -- *much of the UMLS can be downloaded directly from the NCBI official FTP site. In fact, the NIH (and implicitly the NLM) *have already made the standard terms
RE: Asking help for always unsuccessful AE load
Hi Jun, Do AE pipelines that do not use the Smoking Status module work? I think that Smoking Status configuration (via binary install) might be broken in the last several versions. I thought that I had submitted a Jira long, long ago, but right now I can't find it so maybe my memory is playing games. I have gotten the module to work, but it took hours to find and fix the problems. If you can get other AEs to run then let me know and I'll try to find my working setup and diff it with the cTakes install tomorrow. If I remember correctly I had to move (unpack) some things from lib/ jars to resources/ and change a path or two in the desc/ xmls. Sean From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Wednesday, December 03, 2014 10:52 AM To: dev@ctakes.apache.org Subject: Re: Asking help for always unsuccessful AE load Hi Jun, I know this has been a problem in some versions. What version are you using? Could you try this out on the latest release candidate to see if it is still an issue? Thanks, [IMAT Solutions]http://imatsolutions.com Kim Ebert Software Engineer [Office:]801.669.7342 kim.eb...@imatsolutions.commailto:greg.hub...@imatsolutions.com On 12/02/2014 08:28 PM, Ying, Jun wrote: Dear Sir/Madam, When I Load some AE in cTakes like SimulatedProdSmokingTAE.xml, It always jump the Exception java.lang.illegalArgumentException: URl is not hierarchical. Why it happens? How to fix it. Thanks. [X] The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
RE: Scaling cTakes
Hi Brandon, It sounds like you've got a decent pipeline set up. To increase the speed you could try swapping out use of ctakes-dictionary-lookup with ctakes-dictionary-lookup-fast in the AE. Check ctakes-clinical-pipeline/desc/[ae]/AggregatePlaintextFastUMLSProcessor.xml for an example. As for the CASPool, I don't think that it will make any difference for cTakes. Sean From: Geise, Brandon D. [bdge...@geisinger.edu] Sent: Friday, December 05, 2014 12:40 PM To: dev@ctakes.apache.org Subject: Scaling cTakes Hi, I'm new to cTakes and the UIMA framework. I've read most of the UIMA documentation and was able to take the BagofCUIGenerator example and modify to read notes from a DB, process using the UMLS AE in the clinical-pipeline using a local DB version of UMLS, and output the CUIs to a DB. However, the problem I'm having is it's extremely slow; ~3.5-4 notes a minute. I was hoping I could get some hints or advice on speeding the process up. I read there's a patch for LVG, but wasn't quite sure how to implement. Also from testing using the CPE GUI, I don't notice any different in processing time by adjusting the CASPool setting. Some advice on the CASPool would be appreciated also. Thanks, Brandon IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you. Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.
RE: Scaling cTakes
Hi Brandon, You are welcome. I was hoping that you'd get the note processing time down to under a second with the different lookup, but I guess not. I think that any optimization from here really depends upon what information you want to extract from the notes. Sean From: Geise, Brandon D. [bdge...@geisinger.edu] Sent: Tuesday, December 09, 2014 9:13 AM To: dev@ctakes.apache.org Subject: RE: Scaling cTakes Thanks again Sean for the advice. Just by changing the pipeline to use the fast dictionary led to quadrupling the processing speed. Any other suggestions on performance tuning would be great! Thanks, Brandon -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Friday, December 05, 2014 1:14 PM To: dev@ctakes.apache.org Subject: RE: Scaling cTakes Hi Brandon, It sounds like you've got a decent pipeline set up. To increase the speed you could try swapping out use of ctakes-dictionary-lookup with ctakes-dictionary-lookup-fast in the AE. Check ctakes-clinical-pipeline/desc/[ae]/AggregatePlaintextFastUMLSProcessor.xml for an example. As for the CASPool, I don't think that it will make any difference for cTakes. Sean From: Geise, Brandon D. [bdge...@geisinger.edu] Sent: Friday, December 05, 2014 12:40 PM To: dev@ctakes.apache.org Subject: Scaling cTakes Hi, I'm new to cTakes and the UIMA framework. I've read most of the UIMA documentation and was able to take the BagofCUIGenerator example and modify to read notes from a DB, process using the UMLS AE in the clinical-pipeline using a local DB version of UMLS, and output the CUIs to a DB. However, the problem I'm having is it's extremely slow; ~3.5-4 notes a minute. I was hoping I could get some hints or advice on speeding the process up. I read there's a patch for LVG, but wasn't quite sure how to implement. Also from testing using the CPE GUI, I don't notice any different in processing time by adjusting the CASPool setting. Some advice on the CASPool would be appreciated also. Thanks, Brandon IMPORTANT WARNING: The information in this message (and the documents attached to it, if any) is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken, or omitted to be taken, in reliance on it is prohibited and may be unlawful. If you have received this message in error, please delete all electronic copies of this message (and the documents attached to it, if any), destroy any hard copies you may have created and notify me immediately by replying to this email. Thank you. Geisinger Health System utilizes an encryption process to safeguard Protected Health Information and other confidential data contained in external e-mail messages. If email is encrypted, the recipient will receive an e-mail instructing them to sign on to the Geisinger Health System Secure E-mail Message Center to retrieve the encrypted e-mail.
RE: revamping the Apache cTAKES website
Anyway, a pretty amazing fresh start, thanks Pei -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Monday, December 15, 2014 4:33 PM To: dev@ctakes.apache.org Subject: RE: revamping the Apache cTAKES website Check out a mockup of a new website proposal: http://svn.apache.org/repos/asf/ctakes/site/new/index.html Based off bootstrap (Idea borrowed from the Spark folks..). Couple of key pieces of info: - 10% of visitors are on mobile/tablets - The most currently visited pages are: downloads.cgi, gettingstarted.html. I suggest we focus our attention on those 2 items. (Putting a Downloads link right on the front page, etc.) svn co http://svn.apache.org/repos/asf/ctakes/site/new if you want to checkout the code of the site. --Pei -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Friday, December 05, 2014 6:34 PM To: dev@ctakes.apache.org Cc: dev@ctakes.apache.org Subject: RE: revamping the Apache cTAKES website I would like to second the bootstrap recommendation, with the additional recommendation of django for the backend. It is an amazing platform for rapid development and easy updating. JG — Sent from Mailbox On Fri, Dec 5, 2014 at 12:15 PM, Savova, Guergana guergana.sav...@childrens.harvard.edu wrote: There are now 4 volunteers: Michelle Chen Pei Chen Sean Finan Guergana Savova --Guergana -Original Message- From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] Sent: Friday, December 05, 2014 11:56 AM To: dev@ctakes.apache.org Subject: RE: revamping the Apache cTAKES website Wonderful, thank you, Michelle! There will be a flurry of emails the week of Dec 15 followed by actual work, so book your calendar if possible... --Guergana -Original Message- From: Michelle Chen [mailto:michelle1919c...@gmail.com] Sent: Friday, December 05, 2014 11:48 AM To: dev@ctakes.apache.org Subject: Re: revamping the Apache cTAKES website Hello Guergana, I don't know that much about cTakes, but would be interested in contributing to the effort. I'm not sure if there is an interest in matching the website design of other Apache projects, but it seems that the two main designs that are being used from my arbitrary search on http://projects.apache.org/indexes/alpha.html is 1. the current design that cTakes is using and 2. a Bootstrap approach. I've done a little bit of work on Bootstrap and would be interested in helping with that. Let me know how I can be helpful. Sincerely, Michelle Chen :) Be strong and of good courage; do not be afraid, nor be dismayed, for the Lord your God is with you wherever you go. ~Joshua 1:9 On Fri, Dec 5, 2014 at 11:21 AM, Savova, Guergana guergana.sav...@childrens.harvard.edu wrote: cTAKES-ers, we would like to start working on updating the Apache cTAKES website - some of the information there is already stale and needs refreshing. Do you have ideas on website design, content, etc.? Would you like to contribute to the effort? We are planning to start working on the website the week of Dec 15. Cheers, --Guergana
RE: Problem running cTakes-clinical pipeline -- AggregatePlaintextFastUMLSProcessor.xml
Hi Yu, Also do you know is there any command line I can run to annotate like a thousand files automatically rather than copy and paster. You could try the CPE gui : bin/runctakesCPE.sh Sean From: Liang, Yu [mailto:yu.li...@nyumc.org] Sent: Monday, December 15, 2014 4:51 PM To: dev@ctakes.apache.org Subject: Problem running cTakes-clinical pipeline -- AggregatePlaintextFastUMLSProcessor.xml Hi Yu, I think this is a current limitation in cTAKES. I think it has to do with negation not detecting if the line breaks are separating the sentences. Would you mind forwarding the example to dev@ctakes.apache.orgmailto:dev@ctakes.apache.org? I think Tim and others may be working on this issue. --Pei On Mon, Dec 15, 2014 at 3:54 PM, Liang, Yu yu.li...@nyumc.orgmailto:yu.li...@nyumc.org wrote: On Dec 15, 2014, at 2:58 PM, Liang, Yu yu.li...@nyumc.orgmailto:yu.li...@nyumc.org wrote: Hi Pei Chen, Could you please look at the following example I run, I think the result is not accurate. The polarity of illness is -1 but for fever, vomiting, diarrhea,and pain are all +1. Also do you know is there any command line I can run to annotate like a thousand files automatically rather than copy and paster. Yu Liang [cid:DF19883E-B993-4CD0-90BD-F285A3C1A5A3@wireless.nyumc.org] Yu Liang CHIBI
RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link
Hi John, Look for an Upload button in the upper-left corner next to a blue Sign in button. Sean -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Tuesday, December 16, 2014 11:12 AM To: dev@ctakes.apache.org Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link That is, how do we upload videos *to the channel. * On Tue, Dec 16, 2014 at 11:09 AM, John Green john.travis.gr...@gmail.com wrote: How do we upload videos we wish to contribute? I dont have any experience with youtube other than as a watcher. JG On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hmmm, I can't find it in a search. However, here is a direct link: https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ Maybe it needs a few videos to register in the search engine ? Sean -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Monday, December 15, 2014 11:32 AM To: dev@ctakes.apache.org Subject: Re: intro video and ctakes youtube John, I presume you this thread: http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C39 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E Strange, I couldn't find it anymore either... The place holder could have been auto deleted because it was empty? I think it's worth it if you're willing to create and add to it again... ---Pei On Fri, Dec 12, 2014 at 11:46 PM, John Green john.travis.gr...@gmail.com wrote: I was going to post some basic how to videos that help with the learning curve I've walked over the last year and a half. I went looking for ctakes youtube channel mentioned awhile back and I did not find it... Anyone know where it went? Best, JG
RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link
Hmmm, well this is a ticker: http://www.ampercent.com/upload-videos-youtube-channel-without-knowing-username-password/9374/ -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Wednesday, December 17, 2014 2:08 PM To: dev@ctakes.apache.org Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link Isnt this to upload for my account? What about to the channel? On Tue, Dec 16, 2014 at 12:16 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi John, Look for an Upload button in the upper-left corner next to a blue Sign in button. Sean -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Tuesday, December 16, 2014 11:12 AM To: dev@ctakes.apache.org Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link That is, how do we upload videos *to the channel. * On Tue, Dec 16, 2014 at 11:09 AM, John Green john.travis.gr...@gmail.com wrote: How do we upload videos we wish to contribute? I dont have any experience with youtube other than as a watcher. JG On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hmmm, I can't find it in a search. However, here is a direct link: https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ Maybe it needs a few videos to register in the search engine ? Sean -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Monday, December 15, 2014 11:32 AM To: dev@ctakes.apache.org Subject: Re: intro video and ctakes youtube John, I presume you this thread: http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C 39 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E Strange, I couldn't find it anymore either... The place holder could have been auto deleted because it was empty? I think it's worth it if you're willing to create and add to it again... ---Pei On Fri, Dec 12, 2014 at 11:46 PM, John Green john.travis.gr...@gmail.com wrote: I was going to post some basic how to videos that help with the learning curve I've walked over the last year and a half. I went looking for ctakes youtube channel mentioned awhile back and I did not find it... Anyone know where it went? Best, JG
RE: cTakes Annotation Comparison
One quick mention: The cTakes dictionaries are built with UMLS 2011AB. If the Human annotations were not done using the same UMLS version then there WILL be differences in CUI and Semantic group. I don't have time to go into it with details, examples, etc. just be aware that every 6 months cuis are added, removed, deprecated, and moved from one TUI to another. Sean -Original Message- From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] Sent: Friday, December 19, 2014 1:28 PM To: dev@ctakes.apache.org Subject: RE: cTakes Annotation Comparison Several thoughts: 1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, Medications and Anatomical Sites. Therefore you will get MANY more annotations with cTAKES. Eventually the ShARe corpus will be expanded to the other types. 2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 of cTAKES on the ShARe corpus if you output only mentions of type Disease/Disorder. 3. Could you send us the list of files you use from ShARe to test? We have the corpus and would like to run against as well. Hope this makes sense... --Guergana -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 1:16 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Our analysis against the human adjudicated gold standard from this SHARE corpus is using a simple check to see if the cTakes output included the annotation specified by the gold standard. The initial results I reported were for exact matches of CUI and text span. Only exact matches were counted. It looks like if we also count as matches cTakes annotations with a matching CUI and a text span that overlaps the gold standard text span then the matches increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the the old pipeline. The question was also asked about annotations in the cTakes output that were not in the human adjudicated gold standard. The answer is yes, there were a lot of additional annotations made by cTakes that don't appear to be in the gold standard. We haven't analyzed that yet, but it looks like the gold standard we are using may only have Disease_Disorder annotations. [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy timothy.mil...@childrens.harvard.edu wrote: Thanks Kim, This sounds interesting though I don't totally understand it. Are you saying that extraction performance for a given note depends on which order the note was in the processing queue? If so that's pretty bad! If you (or anyone else who understands this issue) has a concrete example I think that might help me understand what the problem is/was. Even though, as Pei mentioned, we are going to try moving the community to the faster dictionary, I would like to understand better just to help myself avoid issues of this type going forward (and verify the new dictionary doesn't use similar logic). Also, when we finish annotating the sample notes, might we use that as a point of comparison for the two dictionaries? That would get around the issue that not everyone has access to the datasets we used for validation and others are likely not able to share theirs either. And maybe we can replicate the notes if we want to simulate the scenario Kim is talking about with thousands or more notes. Tim On 12/19/2014 10:24 AM, Kim Ebert wrote: Guergana, I'm curious to the number of records that are in your gold standard sets, or if your gold standard set was run through a long running cTAKES process. I know at some point we fixed a bug in the old dictionary lookup that caused the permutations to become corrupted over time. Typically this isn't seen in the first few records, but over time as patterns are used the permutations would become corrupted. This caused documents that were fed through cTAKES more than once to have less codes returned than the first time. For example, if a permutation of 4,2,3,1 was found, the permutation would be corrupted to be 1,2,3,4. It would no longer be possible to detect permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the corpus size, I could see the permutation engine eventually only have a single permutation of 1,2,3,4. Typically though, this isn't very easily detected in the first 100 or so documents. We discovered this issue when we made cTAKES have consistent output of codes in our system. [IMAT
RE: cTakes Annotation Comparison
I’m bringing it up in case the Human Annotations were done using a different version. From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 1:40 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Sean, I don't think that would be an issue since both the rare word lookup and the first word lookup are using UMLS 2011AB. Or is the rare word lookup using a different dictionary? I would expect roughly similar results between the two when it comes to differences between UMLS versions. [IMAT Solutions]http://imatsolutions.com Kim Ebert Software Engineer [Office:]801.669.7342 kim.eb...@imatsolutions.commailto:greg.hub...@imatsolutions.com On 12/19/2014 11:31 AM, Finan, Sean wrote: One quick mention: The cTakes dictionaries are built with UMLS 2011AB. If the Human annotations were not done using the same UMLS version then there WILL be differences in CUI and Semantic group. I don't have time to go into it with details, examples, etc. just be aware that every 6 months cuis are added, removed, deprecated, and moved from one TUI to another. Sean -Original Message- From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] Sent: Friday, December 19, 2014 1:28 PM To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org Subject: RE: cTakes Annotation Comparison Several thoughts: 1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, Medications and Anatomical Sites. Therefore you will get MANY more annotations with cTAKES. Eventually the ShARe corpus will be expanded to the other types. 2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 of cTAKES on the ShARe corpus if you output only mentions of type Disease/Disorder. 3. Could you send us the list of files you use from ShARe to test? We have the corpus and would like to run against as well. Hope this makes sense... --Guergana -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 1:16 PM To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Our analysis against the human adjudicated gold standard from this SHARE corpus is using a simple check to see if the cTakes output included the annotation specified by the gold standard. The initial results I reported were for exact matches of CUI and text span. Only exact matches were counted. It looks like if we also count as matches cTakes annotations with a matching CUI and a text span that overlaps the gold standard text span then the matches increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the the old pipeline. The question was also asked about annotations in the cTakes output that were not in the human adjudicated gold standard. The answer is yes, there were a lot of additional annotations made by cTakes that don't appear to be in the gold standard. We haven't analyzed that yet, but it looks like the gold standard we are using may only have Disease_Disorder annotations. [image: IMAT Solutions] http://imatsolutions.comhttp://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.commailto:bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy timothy.mil...@childrens.harvard.edumailto:timothy.mil...@childrens.harvard.edu wrote: Thanks Kim, This sounds interesting though I don't totally understand it. Are you saying that extraction performance for a given note depends on which order the note was in the processing queue? If so that's pretty bad! If you (or anyone else who understands this issue) has a concrete example I think that might help me understand what the problem is/was. Even though, as Pei mentioned, we are going to try moving the community to the faster dictionary, I would like to understand better just to help myself avoid issues of this type going forward (and verify the new dictionary doesn't use similar logic). Also, when we finish annotating the sample notes, might we use that as a point of comparison for the two dictionaries? That would get around the issue that not everyone has access to the datasets we used for validation and others are likely not able to share theirs either. And maybe we can replicate the notes if we want to simulate the scenario Kim is talking about with thousands or more notes. Tim On 12/19/2014 10:24 AM, Kim Ebert wrote: Guergana, I'm curious to the number of records that are in your gold standard sets, or if your gold standard set was run through a long running cTAKES process. I know at some point we fixed a bug in the old dictionary lookup that caused
RE: cTakes Annotation Comparison
Hi Bruce, I'm not sure how there would be fewer matches with the overlap processor. There should be all of the matches from the non-overlap processor plus those from the overlap. Decreasing from 215 to 211 is strange. Have you done any manual spot checks on this? It is really bizarre that you'd only have two matches per document (100 docs?). Thanks, Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 3:23 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Sean, I tried the configuration changes you mentioned in your earlier email. The results are as follows: Total Annotations found: 12,161 (default configuration found 8,284) If counting exact span matches, this run only matched 211 (default configuration matched 215). If counting overlapping spans, this run only matched 220 (default configuration matched 224) Bruce [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: Kim, Maintenance is the factor not bugs/issue to forge ahead. They are 2 components that do the same thing with the same goal (As Sean mentioned, one should be able configure the new code base to replicate the old algorithm if required- it’s just a simpler and cleaner code base. If this is not the case or if there are issues, we should fix it and move forward.). We can keep the old component around for as long as needed, but it’s likely going to have limited support… --Pei *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com] *Sent:* Friday, December 19, 2014 1:47 PM *To:* Chen, Pei; dev@ctakes.apache.org *Subject:* Re: cTakes Annotation Comparison Pei, I don't think bugs/issues should be part of determining if one algorithm vs the other is superior. Obviously, it is worth mentioning the bugs, but if the fast lookup method has worse precision and recall but better performance, vs the slower but more accurate first word lookup algorithm, then time should be invested in fixing those bugs and resolving those weird issues. Now I'm not saying which one is superior in this case, as the data will end up speaking for itself one way or the other; bus as of right now, I'm not convinced yet that the old dictionary lookup is obsolete yet, and I'm not sure the community is convinced yet either. [image: IMAT Solutions] http://imatsolutions.com *Kim Ebert* Software Engineer [image: Office:]801.669.7342 kim.eb...@imatsolutions.com greg.hub...@imatsolutions.com On 12/19/2014 08:39 AM, Chen, Pei wrote: Also check out stats that Sean ran before releasing the new component on: http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup- fast/doc/DictionaryLookupStats.docx From the evaluation and experience, the new lookup algorithm should be a huge improvement in terms of both speed and accuracy. This is very different than what Bruce mentioned… I’m sure Sean will chime here. (The old dictionary lookup is essentially obsolete now- plagued with bugs/issues as you mentioned.) --Pei *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com kim.eb...@perfectsearchcorp.com] *Sent:* Friday, December 19, 2014 10:25 AM *To:* dev@ctakes.apache.org *Subject:* Re: cTakes Annotation Comparison Guergana, I'm curious to the number of records that are in your gold standard sets, or if your gold standard set was run through a long running cTAKES process. I know at some point we fixed a bug in the old dictionary lookup that caused the permutations to become corrupted over time. Typically this isn't seen in the first few records, but over time as patterns are used the permutations would become corrupted. This caused documents that were fed through cTAKES more than once to have less codes returned than the first time. For example, if a permutation of 4,2,3,1 was found, the permutation would be corrupted to be 1,2,3,4. It would no longer be possible to detect permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the corpus size, I could see the permutation engine eventually only have a single permutation of 1,2,3,4. Typically though, this isn't very easily detected in the first 100 or so documents. We discovered this issue when we made cTAKES have consistent output of codes in our system. [image: IMAT Solutions] http://imatsolutions.com *Kim Ebert* Software Engineer [image: Office:]801.669.7342 kim.eb...@imatsolutions.com greg.hub...@imatsolutions.com On 12/19/2014 07:05 AM, Savova, Guergana wrote: We are doing a similar kind of evaluation and will report the results. Before we released the Fast lookup, we did
RE: cTakes Annotation Comparison
Hi Bruce, Correction -- So far, I did steps 1 and 2 of Sean's email. No problem. Aside from recreating the database, those two steps have the greatest impact. But before you change anything else, please do some manual spot checks. I have never seen a case where the lookup would be so horribly inaccurate. Thanks -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 3:29 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Correction -- So far, I did steps 1 and 2 of Sean's email. [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen bruce.tiet...@perfectsearchcorp.com wrote: Sean, I tried the configuration changes you mentioned in your earlier email. The results are as follows: Total Annotations found: 12,161 (default configuration found 8,284) If counting exact span matches, this run only matched 211 (default configuration matched 215). If counting overlapping spans, this run only matched 220 (default configuration matched 224) Bruce [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: Kim, Maintenance is the factor not bugs/issue to forge ahead. They are 2 components that do the same thing with the same goal (As Sean mentioned, one should be able configure the new code base to replicate the old algorithm if required- it’s just a simpler and cleaner code base. If this is not the case or if there are issues, we should fix it and move forward.). We can keep the old component around for as long as needed, but it’s likely going to have limited support… --Pei *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com] *Sent:* Friday, December 19, 2014 1:47 PM *To:* Chen, Pei; dev@ctakes.apache.org *Subject:* Re: cTakes Annotation Comparison Pei, I don't think bugs/issues should be part of determining if one algorithm vs the other is superior. Obviously, it is worth mentioning the bugs, but if the fast lookup method has worse precision and recall but better performance, vs the slower but more accurate first word lookup algorithm, then time should be invested in fixing those bugs and resolving those weird issues. Now I'm not saying which one is superior in this case, as the data will end up speaking for itself one way or the other; bus as of right now, I'm not convinced yet that the old dictionary lookup is obsolete yet, and I'm not sure the community is convinced yet either. [image: IMAT Solutions] http://imatsolutions.com *Kim Ebert* Software Engineer [image: Office:]801.669.7342 kim.eb...@imatsolutions.com greg.hub...@imatsolutions.com On 12/19/2014 08:39 AM, Chen, Pei wrote: Also check out stats that Sean ran before releasing the new component on: http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup -fast/doc/DictionaryLookupStats.docx From the evaluation and experience, the new lookup algorithm should be a huge improvement in terms of both speed and accuracy. This is very different than what Bruce mentioned… I’m sure Sean will chime here. (The old dictionary lookup is essentially obsolete now- plagued with bugs/issues as you mentioned.) --Pei *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com kim.eb...@perfectsearchcorp.com] *Sent:* Friday, December 19, 2014 10:25 AM *To:* dev@ctakes.apache.org *Subject:* Re: cTakes Annotation Comparison Guergana, I'm curious to the number of records that are in your gold standard sets, or if your gold standard set was run through a long running cTAKES process. I know at some point we fixed a bug in the old dictionary lookup that caused the permutations to become corrupted over time. Typically this isn't seen in the first few records, but over time as patterns are used the permutations would become corrupted. This caused documents that were fed through cTAKES more than once to have less codes returned than the first time. For example, if a permutation of 4,2,3,1 was found, the permutation would be corrupted to be 1,2,3,4. It would no longer be possible to detect permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the corpus size, I could see the permutation engine eventually only have a single permutation of 1,2,3,4. Typically though, this isn't very easily detected in the first 100 or so documents. We discovered this issue when we made cTAKES have consistent output of codes in our system. [image: IMAT Solutions] http://imatsolutions.com *Kim Ebert*
RE: cTakes Annotation Comparison --- (^:
Apologies accepted. I'm really glad that you found the problem. So what you are saying is (just to be very very clear to everybody reading this thread): FastUMLSProcessor found 2795 matches (2,842 including overlaps) While UMLSProcessor found 2632 matches (2,735 including overlaps) --- So recall is BETTER in the fast lookup And... FastUMLSProcessor found 30,716 annotations While UMLSProcessor found 31,598 annotations --- So precision is also looking BETTER in the fast lookup Now maybe there will be a little more buy-in for the fast lookup. Cheers, Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 5:05 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison My apologies to Sean and everyone, I am happy to report that I found a bug in our analysis tools that was missing the last FSArray entry for any FSArray list. With the bug fixed, the results look MUCH better. UMLSProcessor found 31,598 annotations FastUMLSProcessor found 30,716 annotations There were 23,522 annotations that were exact matches between the two. When comparing with the gold standard annotations (4591 annotations): UMLSProcessor found 2632 matches (2,735 including overlaps) FastUMLSProcessor found 2795 matches (2,842 including overlaps) [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen bruce.tiet...@perfectsearchcorp.com wrote: I'll do that -- there is always a possibility of bugs in the analysis tool. [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Sorry, I meant “Do some spot checks on the validity”. In other words, when your script reports that a cui and/or span is missing, manually look at the data and see if it really is. Just open up one .xmi in the CVD and see what it looks like. Thanks, Sean *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] *Sent:* Friday, December 19, 2014 3:37 PM *To:* dev@ctakes.apache.org *Subject:* Re: cTakes Annotation Comparison My original results were using a newly downloaded cTakes 3.2.1 with the separately downloaded resources copied in. There were no changes to any of the configuration files. As far as this last run, I modified the UMLSLookupAnnotator.xml and AggregatePlaintextFastUMLSProcessor.xml. I've attached the modified ones I used (but they may not get through the mailing list). [image: Image removed by sender. IMAT Solutions] http://imatsolutions.com *Bruce Tietjen* Senior Software Engineer [image: Image removed by sender. Mobile:]801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Bruce, I'm not sure how there would be fewer matches with the overlap processor. There should be all of the matches from the non-overlap processor plus those from the overlap. Decreasing from 215 to 211 is strange. Have you done any manual spot checks on this? It is really bizarre that you'd only have two matches per document (100 docs?). Thanks, Sean -Original Message- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 3:23 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Sean, I tried the configuration changes you mentioned in your earlier email. The results are as follows: Total Annotations found: 12,161 (default configuration found 8,284) If counting exact span matches, this run only matched 211 (default configuration matched 215). If counting overlapping spans, this run only matched 220 (default configuration matched 224) Bruce [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei pei.c...@childrens.harvard.edu wrote: Kim, Maintenance is the factor not bugs/issue to forge ahead. They are 2 components that do the same thing with the same goal (As Sean mentioned, one should be able configure the new code base to replicate the old algorithm if required- it’s just a simpler and cleaner code base. If this is not the case or if there are issues, we should fix it and move forward.). We can keep the old component around for as long as needed, but it’s likely going to have limited support… --Pei *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com] *Sent:* Friday, December 19, 2014 1:47 PM *To:* Chen, Pei; dev@ctakes.apache.org *Subject:* Re: cTakes Annotation Comparison
RE: Using cTakes programmatically
Hi Maite Meseure, Check the cTakes User guide on UMLS setup: https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide#cTAKES3.2UserInstallGuide-(Recommended)AddUMLSaccessrights which (in part) points you towards obtaining a license to use the NIH UMLS dictionary: https://uts.nlm.nih.gov/license.html Sean From: Maite Meseure Hugues [meseure.ma...@gmail.com] Sent: Monday, December 29, 2014 4:17 PM To: dev@ctakes.apache.org Subject: Using cTakes programmatically Dear all, I allow myself to contact you in order to ask you how I can simply add cTAKES packages in my java code to get the same output than the XML output from the CPE (using clinical-pipeline/ test_plaintext.xml as descriptor). I've explored and tested the cTakes example ( using ClinicalPipelineFactory.getDefaultPipeline() ) but I've got this error message: [...] https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: maitemeseure Exception in thread main org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator failed. (Descriptor: unknown) Thanks a lot for your time. Best regards -- -- Maïté Meseure Hugues
RE: Question about the pipeline
Hi Tol (and Maite), I'm not entirely certain that I understand the question, but here is an attempt to help. If I'm oversimplifying then I apologize. I think that ExampleAggregatePipeline is intended to represent a very simple single-note pipeline and that custom code could be produced by using it as an example. If you want to process texts in a directory, you can find with a web search plenty of ways to list files in a directory and read text from files. org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader might be what you used in the CPE, and you can certainly peruse the code and take what you need. Or, if you decide to write a simple diy, here is one possibility: Static public CollectionFile getFilesInDir( final File directory ) { final CollectionFile fileList = new ArrayList(); final File[] fileList = directory.listFiles(); if ( fileList == null ) { System.err.println( please check the directory + directory.getAbsolutePath() ); System.exit( 1 ); } for ( final File file : directory.listFiles() ) { if ( file.canRead() ) { fileList.add( file ); } } } Static public String getTextInFile( final File file ) throws IOException { -- or handle ioE herein final Path nioPath = file.toPath(); return new String( Files.readAllBytes( nioPath ) ); } Static public void main( String ... args ) { If ( args[0].isEmpty() ) { System.out.println( Enter a directory path ); System.exit( 0 ); } Final CollectionFile files = getFilesInDir( new File( args[0] ); For ( File file : files ) { Final String note = getTextInFile( file ); --- Insert here code a' la ExampleAggregatePipeline --- --- swap out the writer in ExampleAggregatePipeline with CasIOUtil method (below) --- } } I must admit that I have never directly used it, but there is an xmi file writing method in org.apache.uima.fit.util.CasIOUtil named writeXmi( JCas jCas, File file ). You could give this a try and see if it produces the type of output that you want. The same utility class has a writeXCas(..) method. If the above has absolutely nothing to do with your needs then please send me a bulleted list of items, example workflow, etc. and I'll see if I can be of service. Oh, and I wrote the above code freehand, so MS Outlook is adding capital letters, etc. If you cut and paste you'll need to change that - plus I haven't run/compiled, so there might be a typo or missed exception or something. Or it may not work (in which case I'll throw in a little more effort). Sean -Original Message- From: Tol O. [mailto:tol...@gmail.com] Sent: Monday, February 02, 2015 6:56 PM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Maite Meseure Hugues meseure.maite@... writes: Hello all, Thank you for your preceding answers. I have a few questions regarding the pipeline example to run cTakes programmatically. I am running ExampleAggregatePipeline.java with ExampleHelloWorldAnnotator but I would like to know how I can change it to run my data, as the CPE where we can choose the directory of our data. My second question is about the xml output generated with the CPE, can I get the same xml output in using the example pipeline? and How? Thanks for your time. I would like to ask the same question. After successfully setting up CTAKES following the Developers Guide I would also like to use a modified ExampleAggregatePipeline to output a CAS file identical to the output obtained by the CPE or the CVD when following the Users Guide. This would be a great help for developers as a starting class to be able to programmatically obtain an annotated file based on a plaintext or XML input, same as through the two GUIs. Right now I am reading through the Component Use Guide to replicate the CPE or the CVD tutorial with the test input, but it is a bit overwhelming. Any pointers or suggestions would be really appreciated. Tol O.
RE: Question about the pipeline
Hi Maite, RunCPE is a good find, and if it fits your bil hten you should use it. But it (if you mean the yTex class) doesn't take input and output directories from the command line. It does take the path to a CPE.xml file. There is a cTakes (non-yTex) equivalent named CmdLineCpeRunner. Either one of them should print a usage if you run it without arguments. As the CmdLineCpeRunner indicates, you can create a cpe .xml file with the cpe gui. Basically, start the cpe gui, select your input (reader), output (writer) and pipeline (ae) in the gui and then save the cpe descriptor (via the menubar). You can exit the gui and run either one of the cmd line utilities with the path to that cpe .xml descriptor as the argument. Please note: sometimes you have to explicitly type .xml in the filename when saving with the cpe gui. If you run with the cpe gui and then exit it should automatically ask you if you want to save the cpe .xml descriptor. Anyway, once you have the .xml file you can always edit the input and output paths in that file to change your run parameters. Sean -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Tuesday, February 03, 2015 9:01 AM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Thanks a lot Sean for your detailed reply. I've also found RunCPE.java that allows to put the input and outpur directories in arguments in the environment and do the same job than the CPE-GUI -at least in Eclipse, I haven't managed to run it via the command line yet. On Mon, Feb 2, 2015 at 7:12 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Tol (and Maite), I'm not entirely certain that I understand the question, but here is an attempt to help. If I'm oversimplifying then I apologize. I think that ExampleAggregatePipeline is intended to represent a very simple single-note pipeline and that custom code could be produced by using it as an example. If you want to process texts in a directory, you can find with a web search plenty of ways to list files in a directory and read text from files. org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader might be what you used in the CPE, and you can certainly peruse the code and take what you need. Or, if you decide to write a simple diy, here is one possibility: Static public CollectionFile getFilesInDir( final File directory ) { final CollectionFile fileList = new ArrayList(); final File[] fileList = directory.listFiles(); if ( fileList == null ) { System.err.println( please check the directory + directory.getAbsolutePath() ); System.exit( 1 ); } for ( final File file : directory.listFiles() ) { if ( file.canRead() ) { fileList.add( file ); } } } Static public String getTextInFile( final File file ) throws IOException { -- or handle ioE herein final Path nioPath = file.toPath(); return new String( Files.readAllBytes( nioPath ) ); } Static public void main( String ... args ) { If ( args[0].isEmpty() ) { System.out.println( Enter a directory path ); System.exit( 0 ); } Final CollectionFile files = getFilesInDir( new File( args[0] ); For ( File file : files ) { Final String note = getTextInFile( file ); --- Insert here code a' la ExampleAggregatePipeline --- --- swap out the writer in ExampleAggregatePipeline with CasIOUtil method (below) --- } } I must admit that I have never directly used it, but there is an xmi file writing method in org.apache.uima.fit.util.CasIOUtil named writeXmi( JCas jCas, File file ). You could give this a try and see if it produces the type of output that you want. The same utility class has a writeXCas(..) method. If the above has absolutely nothing to do with your needs then please send me a bulleted list of items, example workflow, etc. and I'll see if I can be of service. Oh, and I wrote the above code freehand, so MS Outlook is adding capital letters, etc. If you cut and paste you'll need to change that - plus I haven't run/compiled, so there might be a typo or missed exception or something. Or it may not work (in which case I'll throw in a little more effort). Sean -Original Message- From: Tol O. [mailto:tol...@gmail.com] Sent: Monday, February 02, 2015 6:56 PM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Maite Meseure Hugues meseure.maite@... writes: Hello all, Thank you for your preceding answers. I have a few questions regarding the pipeline example to run cTakes programmatically. I am running ExampleAggregatePipeline.java with ExampleHelloWorldAnnotator but I would like to know how I can change it to run my data, as the CPE where we can choose the directory of our data. My second question is about the xml output generated with the CPE, can I get
RE: Question about the pipeline
Hi Maite, Without more information I can't venture a guess as to a cause of the error. If RunCPE works then why not use that? They are practically identical. Sean From: Maite Meseure Hugues [meseure.ma...@gmail.com] Sent: Thursday, February 05, 2015 8:51 AM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline I see. In my case, I am using the CPE descriptor saved from the GUI for CmdLineCpeRunner as said Sean. I've selected AggregatePlaintextProcessor.xml as AE but I have this error: Couldn't initialize processing engine. Initialization of CAS Processor with name AggregatePlaintextProcessor failed. Meanwhile, RunCPE.java works properly with the same descriptor in Eclipse. Does anyone have an idea? On Wed, Feb 4, 2015 at 12:56 PM, Lingren, Todd todd.ling...@cchmc.org wrote: Hi Maite, For each patient in my list, I create a new FilesToFiles CPE xml using some sed commands on the template original. Specifically, here's the command line argument (I'm on linux). CTAKES_HOME=... java -cp $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/ -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms512M -Xmx2048M CmdLineCpeRunner FilesToFiles_patient_cui.xml outputfile.txt I don't think it matters, but I'm using the cTAKES 3.1.0 version. Todd Lingren Biomedical Informatics Cincinnati Children’s Hospital todd.ling...@cchmc.org 513-803-9032 -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Wednesday, February 04, 2015 12:59 PM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Interesting, Todd thank you and how do you use CMdLineCpeRunner basically? Because I tested in cmd line with: java org.apache.ctakes.core.cpe.CmdLineCpeRunner [path-to-my-cpe.xml] but here is that I've got: Exception in thread main java.lang.NoClassDefFoundError: org/apache/uima/util/InvalidXMLException at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2693) at java.lang.Class.privateGetMethodRecursive(Class.java:3040) at java.lang.Class.getMethod0(Class.java:3010) at java.lang.Class.getMethod(Class.java:1776) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) ... On Wed, Feb 4, 2015 at 8:32 AM, Lingren, Todd todd.ling...@cchmc.org wrote: Sean and Maite, FWIW, I use CmdLineCpeRunner frequently. I employ it with a bash script to automatically create a new xml file based on the subfolder names contained in the target directory. So in our HPC, it spawns a new job for each subfolder (which may have between 5 and 2500 notes). Todd Lingren Biomedical Informatics Cincinnati Children’s Hospital todd.ling...@cchmc.org 513-803-9032 -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Tuesday, February 03, 2015 2:47 PM To: dev@ctakes.apache.org Subject: RE: Question about the pipeline Hi Maite, RunCPE is a good find, and if it fits your bil hten you should use it. But it (if you mean the yTex class) doesn't take input and output directories from the command line. It does take the path to a CPE.xml file. There is a cTakes (non-yTex) equivalent named CmdLineCpeRunner. Either one of them should print a usage if you run it without arguments. As the CmdLineCpeRunner indicates, you can create a cpe .xml file with the cpe gui. Basically, start the cpe gui, select your input (reader), output (writer) and pipeline (ae) in the gui and then save the cpe descriptor (via the menubar). You can exit the gui and run either one of the cmd line utilities with the path to that cpe .xml descriptor as the argument. Please note: sometimes you have to explicitly type .xml in the filename when saving with the cpe gui. If you run with the cpe gui and then exit it should automatically ask you if you want to save the cpe .xml descriptor. Anyway, once you have the .xml file you can always edit the input and output paths in that file to change your run parameters. Sean -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Tuesday, February 03, 2015 9:01 AM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Thanks a lot Sean for your detailed reply. I've also found RunCPE.java that allows to put the input and outpur directories in arguments in the environment and do the same job than the CPE-GUI -at least in Eclipse, I haven't managed to run it via the command line yet. On Mon, Feb 2, 2015 at 7:12 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Tol (and Maite), I'm not entirely certain that I understand the question, but here is an attempt to help. If I'm oversimplifying then I apologize. I think
RE: git mirrors out of sync?
Hi Steve, You are right (confirming your finding) - it looks like the first is a no-show and the second is somebody's personal upload to github (not git.apache.org) from 3 years ago. The jira claims that the item was closed (fixed), but if you go to https://urldefense.proofpoint.com/v2/url?u=http-3A__git.apache.org_d=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=6K2jncop0hjH-CSVJRe1t5Ehv0V75znADU0wtfGz_1wm=NERTSV05Tazy9bLFr0JnQeCe6FcppzevqkKgecLBfhAs=hg28ET1-cmNSr9e9uZcva97I5GEgyQGtYqBF1BKSQxUe= cTakes is not listed. Was it there previous to 6 days ago but removed? If nobody responds with a here's yer problem by end of week then I ( or you, if you like) will ping infra. I know that at least one contributor (not me) prefers to use git. Sean -Original Message- From: Steven Bethard [mailto:steven.beth...@gmail.com] Sent: Tuesday, February 03, 2015 3:38 PM To: dev@ctakes.apache.org Subject: git mirrors out of sync? The git mirrors for cTAKES seem to be either broken (https://urldefense.proofpoint.com/v2/url?u=http-3A__git.apache.org_ctakes.gitd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fss=YbXZ5LN-Z295poj6jlkGInSjv6t78b2X0QgO8hI0vwke= ) or embarrassingly out of sync (https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakesd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fss=YW6_xp81csYAksST2pDnIUjQEEI7rmK60iN9NDYO3cge= ). Is this a known issue? I looked at the INFRA ticket [1], but didn't see anything that suggested that there should be a problem. Steve [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_INFRA-2D8553d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fss=-ZNPLIX5GcrgNmQwjs8qmXU8rG_D8de7ymM9_y3gPPMe=
RE: Question about the pipeline
Hi Tol, Essentially, I want to know how to set up the cTAKES objects correctly into a pipeline in a Java programs, so that medical texts are annotated, like the GUI is doing. I would really appreciate any hints or how to accomplish this. Looking at your embedded code I think that you've got the general idea of how to do everything. Perhaps you are wondering how to create custom pipelines by programmatically adding chosen processors? Tim Miller made a great addition (imo) to the cTakes code with the org.apache.ctakes.clinicalpipeline. ClinicalPipelineFactory class. Perhaps you can take a look at that and see if it helps? Sean -Original Message- From: Tol O. [mailto:tol...@gmail.com] Sent: Tuesday, February 03, 2015 7:35 PM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Sean, Thank you for the detailed reply. As you mentioned, I had to revert the capital letters from your Outlook, and also, if somebody else wants to use the code and cannot get it to run: the getFilesInDir method needs to return the populated CollectionFile fileList, the variable final File[] fileList and its usage should be renamed to something else (as the variable name already exists) and the main method needs to throw an IOException. I think these were all the changes I made so that the txt files from a folder are added to the collection, many thanks again. What I am looking to do is also what the description in ExampleAggregatePipeline says, running a pipeline programatically w/o uima xml descriptor xml files. This is accomplished by what I understand the uimaFIT classes, so that AEs can be defined in Java, added to a Pipeline and directly run. The uimaFIT page gives a nice Java snippet that uses uimaFIT in a similar way as the cTAKES example, I pasted the few Java lines below at [1]. https://urldefense.proofpoint.com/v2/url?u=http-3A__uima.apache.org_d_uimafit-2Dcurrent_tools.uimafit.book.html-23ugr.tools.uimafit.introductiond=BQICAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=uhPMXYD_U8cpnenfJCFigx00DCavTuwRGY-irX80FfUs=4s5P35eByjHcLHM6WEp5jmjquPc-wynEgjBWnY6I6Pge= I would like to use cTAKES in my own Java programs such that, just like the ExampleAggregatePipeline, uimaFIT can be used create and run a cTAKES pipeline to annotate medical texts. Then, I could also output the result in CAS files, just like the CVD GUI is doing. This would allow to directly be able to add or modify my own AnalysisEngines. Essentially, I want to know how to set up the cTAKES objects correctly into a pipeline in a Java programs, so that medical texts are annotated, like the GUI is doing. I would really appreciate any hints or how to accomplish this. Following your code example to read the files the outlined idea is: for ( File file : files ) { Final String note = getTextInFile( file ); JCas jCas = JCasFactory.createJCas(); jCas.setDocumentText(note); // 1. create the AnalysisEngines for tokenizer, tagger and other cTAKES components etc. to annotate medical texts // 2. runPipeline(jCas, ...); } [1] The code snippet from uimaFIT: JCas jCas = JCasFactory.createJCas(); jCas.setDocumentText(some text); AnalysisEngine tokenizer = createEngine(MyTokenizer.class); AnalysisEngine tagger = createEngine(MyTagger.class); runPipeline(jCas, tokenizer, tagger); for(Token token : iterate(jCas, Token.class)){ System.out.println(token.getTag()); } Tol O. Finan, Sean Sean.Finan@... writes: Hi Tol (and Maite), I'm not entirely certain that I understand the question, but here is an attempt to help. If I'm oversimplifying then I apologize. I think that ExampleAggregatePipeline is intended to represent a very simple single-note pipeline and that custom code could be produced by using it as an example. If you want to process texts in a directory, you can find with a web search plenty of ways to list files in a directory and read text from files. org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader might be what you used in the CPE, and you can certainly peruse the code and take what you need. Or, if you decide to write a simple diy, here is one possibility: Static public CollectionFile getFilesInDir( final File directory ) { final CollectionFile fileList = new ArrayList(); final File[] fileList = directory.listFiles(); if ( fileList == null ) { System.err.println( please check the directory + directory.getAbsolutePath() ); System.exit( 1 ); } for ( final File file : directory.listFiles() ) { if ( file.canRead() ) { fileList.add( file ); } } } Static public String getTextInFile( final File file ) throws IOException { -- or handle ioE herein final Path nioPath = file.toPath(); return new String( Files.readAllBytes( nioPath ) ); } Static public void main( String ... args ) { If ( args[0
RE: BagOfCuisGenerator.java, same idea for getConceptText()
Try something like the following for output: private int extractFeatures( final IdentifiedAnnotation annotation ) { // Extract the IdentifiedAnnotation itself final CollectionString umlsInfos = getUmlsInfos( annotation, _printSnomed ); if ( umlsInfos == null ) { return 0; } final int begin = annotation.getBegin(); final int end = annotation.getEnd(); final String annotationText = annotation.getCoveredText(); final int polarity = annotation.getPolarity(); int count = 0; for ( String umlsInfo : umlsInfos ) { saveAnnotation( annotationText, umlsInfo, polarity, begin, end ); count++; } return count; } static private CollectionString getUmlsInfos( final IdentifiedAnnotation identifiedAnnotation ) { final FSArray fsArray = identifiedAnnotation.getOntologyConceptArr(); if ( fsArray == null ) { return Collections.emptySet(); } final FeatureStructure[] featureStructures = fsArray.toArray(); final SetString umlsInfos = new HashSetString( featureStructures.length ); for ( FeatureStructure featureStructure : featureStructures ) { final OntologyConcept ontologyConcept = (OntologyConcept) featureStructure; String info = null; if ( ontologyConcept instanceof UmlsConcept ) { final UmlsConcept umlsConcept = (UmlsConcept) ontologyConcept; info = umlsConcept.getCui(); final String tui = umlsConcept.getTui(); if ( tui != null !tui.isEmpty() ) { info += _ + tui; } final String preferredText = umlsConcept.getPreferredText(); if ( preferredText != null !preferredText.isEmpty() ) { info += = \ + preferredText + \; } umlsInfos.add( info ); } } return umlsInfos; } public void saveAnnotation( final String spannedText, final String umlsInfo, final int polarity, final int begin, final int end ) { final String text = begin + , + end + + (polarity 0 ? - : ) + umlsInfo + + spannedText; if ( _writer == null ) { System.out.println( text ); return; } try { _writer.write( text ); _writer.newLine(); } catch ( IOException ioE ) { logger.error( ioE.getMessage() ); } } -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Thursday, February 12, 2015 2:46 PM To: dev@ctakes.apache.org Subject: BagOfCuisGenerator.java, same idea for getConceptText() Hi everyone, I am currently working on BagOfCuisGenerator, and I would like to add the concept text to the output. I 've seen some discussions about getting the original text and UMLS preferred text in addition to the cui. Can someone give me pointers to do that? Thanks in advance for your time. Maite -- -- Maïté Meseure Hugues
RE: BagOfCuisGenerator.java, same idea for getConceptText()
Oh yeah - use the -fast dictionary to get preferred text. The fastest way to get cuis only is with CuisOnlyPlaintextUMLSProcessor. If you want polarity make sure you uncomment the section with PolarityCleartkAnalysisEngine. Sean -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Thursday, February 12, 2015 2:46 PM To: dev@ctakes.apache.org Subject: BagOfCuisGenerator.java, same idea for getConceptText() Hi everyone, I am currently working on BagOfCuisGenerator, and I would like to add the concept text to the output. I 've seen some discussions about getting the original text and UMLS preferred text in addition to the cui. Can someone give me pointers to do that? Thanks in advance for your time. Maite -- -- Maïté Meseure Hugues
RE: BagOfCuisGenerator.java, same idea for getConceptText()
Hi Maite, I just checked the log and it looks like you'll need to use a copy of cTakes built after 12/08/2014 to get Snomed codes. Sean -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Monday, February 16, 2015 12:19 PM To: dev@ctakes.apache.org Subject: Re: BagOfCuisGenerator.java, same idea for getConceptText() Sean, I have a question, is it because I am using fast dictionary I don't get snomed-oid or snomed-code? Instead, it's snomed_oid: null#CTAKES. Thank you. Maite On Fri, Feb 13, 2015 at 1:32 PM, Maite Meseure Hugues meseure.ma...@gmail.com wrote: Thank you for your replies, It's helpful. I was working on 3.2.0 version, so it looks like 3.2.1 allows to get the UMLS preferred text. Maite On Thu, Feb 12, 2015 at 2:25 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Oh yeah - use the -fast dictionary to get preferred text. The fastest way to get cuis only is with CuisOnlyPlaintextUMLSProcessor. If you want polarity make sure you uncomment the section with PolarityCleartkAnalysisEngine. Sean -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Thursday, February 12, 2015 2:46 PM To: dev@ctakes.apache.org Subject: BagOfCuisGenerator.java, same idea for getConceptText() Hi everyone, I am currently working on BagOfCuisGenerator, and I would like to add the concept text to the output. I 've seen some discussions about getting the original text and UMLS preferred text in addition to the cui. Can someone give me pointers to do that? Thanks in advance for your time. Maite -- -- Maïté Meseure Hugues -- -- Maïté Meseure Hugues -- -- Maïté Meseure Hugues
RE: CTAKES mirroring on github.
Our request is for a read-only mirror. However, if it ever becomes i/o, I don't know if this will have what you want, but http://git.apache.org/ Links to documentation (mostly server setup) http://www.apache.org/dev/git.html and a wiki (check toward middle and bottom for committer info) https://wiki.apache.org/general/GitAtApache -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Tuesday, February 17, 2015 12:31 PM To: dev@ctakes.apache.org Subject: Re: CTAKES mirroring on github. Is there any existing resource to help people who want to use git understand the right workflow to contribute to ctakes? (i.e. how this interacts with svn repos). Tim On 02/17/2015 12:23 PM, jay vyas wrote: Hi CTakes. Looks like infra finally got onto the JIRA i made for this a while back. They are currently working on fixing a couple of minor glitches w/ the mirroring (not showing all commits)... but there now is a mirror for CTakes on github. https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache _ctakesd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=Heup- IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674hm=4sEI9mOp kTz6K-DjmNU1s8Do1TGA0_10HqJcowKpDxcs=fNVbyXzpBLSAG6-DIjBZ1vbMp0JGaX90 Lcdzg_EFVvMe=
RE: Question about fast pipeline
Hi Michelle, Did your error have only Could not find . as absolute or did it also have or in ... or in ...? If you see ... or in ... then this is a new issue. If you don't, then you should update your source. If you need to run the release binary then let me know and I can work out sending you a patch. Sean -Original Message- From: michelle1919c...@gmail.com [mailto:michelle1919c...@gmail.com] On Behalf Of Michelle Chen Sent: Monday, January 12, 2015 4:30 PM To: dev@ctakes.apache.org Subject: Question about fast pipeline I'm fairly new to using cTAKES and was trying to figure out how to use the fast pipeline in my Java code. I was able to run the code in Clinical Pipeline Factory with both the default Pipeline and the fast Pipeline. However, when I tried incorporating getDefaultPipeline, I get these errors: ERROR JdbcConnectionFactory - Could not find resources/org/apache/ctakes/dictionary/lookup/fast/ctakessnorx/ctakessnorx.script as absolute. ERROR JdbcRareWordDictionary - Could not Connect to Dictionary UmlsHsqlRareWord Has anyone else encountered this before? Is there something that I should be linking that I forgot to reference? Or do I just need to update the resources folder again? Thank you. --- Michelle Chen
RE: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison
Hi James, Great question. In truth, you may need to run a few times to find out. Doing that with a full pipeline would be tedious, but there is a descriptor in clinical-pipeline named CuisOnlyPlaintextUMLSProcessor.xml that will only obtain Umls cuis. It runs ~50,000 notes per hour on my laptop as-is, so I suggest that you test with that ae. It has lvg commented out by default (for speed). Adding lvg will increase the runtime, but it also will (as you know) find a few additional terms. You can try a few configurations without it and then the best option with it. If you want to test the default dictionary lookup then you can certainly swap the referenced lookup xmls. Changes to the fast dictionary configuration are made in two places: 1. The main descriptor ...-fast/desc/analysis_engine/UmlsLookupAnnotator.xml 2. The resource (dictionary) configuration file resources/.../fast/cTakesHsql..xml A few suggestions, in order of impact: 1. I am guessing that the annotations in clef are human annotated with longest-length spans only. In other words, colon cancer instead of colon cancer and cancer. To best approximate this style of annotation, edit the cTakesHsql.xml in the section rareWordConsumer and change the selected implementation. By default it is DefaultTermConsumer (go figure), but you will want to use the commented-out PrecisionTermConsumer. As the above cTakesHsql comment indicates DefaultTermConsumer will persist all spans. PrecisionTermConsumer will only persist only the longest overlapping span of any semantic group. Doing this should increase precision, and depending upon how good the annotations are it should not greatly change recall. 2. Just for kicks, try using SemanticCleanupTermConsumer. It may slightly increase precision, but it also may decrease recall. Hopefully it doesn't do much at all (PrecisionTermConsumer and proper semantic typing in the dictionary should suffice without this term consumer). 3. Especially for task 2 (acronyms abbreviations), you should try a run with nameminimumSpan/name in UmlsLookupAnnotator.xml set to 2. This changes the minimum allowable span of a term. The default is 3 to increase precision on acronyms abbreviations, but decreasing to 2 may improve recall on the same. The dictionary is not built with anything below 2 characters. 4. On that note (character length), if task 1 does not include acronyms abbreviations, then you can try increasing the minimum span length above 3 and see if there is a good increase in precision without a significant decrease in recall. 5. Try a few runs with overlapping spans in addition to exact matches. To do this use the OverlapJCasTermAnnotator instead of the DefaultJCasTermAnnotator annotator implementation. DefaultJCasTermAnnotator is specified in UmlsLookupAnnotator.xml but I will check in a descriptor for overlap matching. There are additional parameters for that option, but I'll email them after I checkin. 6. By default the new lookup uses Sentence as the lookup window. I did this for two reasons: 1. Not all terms are within Noun Phrases, 2. Some Noun Phrases overlapped, causing repeated lookups (in my 3.0 candidate trials), and 3. Not all cTakes Noun Phrases are accurate. Because the lookup is fast, using a full Sentence for lookup doesn't seem to hurt much. However, you can always switch it back to see if precision is increased enough to warrant the decrease in recall. This is changed in UmlsLookupAnnotator.xml I have run my own tests with the various setups, but I don't want to adversely influence what you run just in case the trends with the share/clef annotations differ. Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Friday, January 09, 2015 3:57 PM To: 'dev@ctakes.apache.org' Subject: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison Sean (or others), Of the various configuration options described below, which values/choices would you recommend for best F1 measure for something like the shared clef 2013 task? https://sites.google.com/site/shareclefehealth/ I'm looking for something that doesn't have to be the best speed-wise, but that is the recommended for optimizing F1 measure. Regards, James -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Friday, December 19, 2014 11:55 AM To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com Subject: RE: cTakes Annotation Comparison Well, I guess that it is time for me to speak up … I must say that I’m happy that people are showing interest in the fast lookup. I am also happy (sort of) that some concerns are being raised – and that there is now community participation in my little toy. I have some concerns about what people are reporting. This does not coincide with what I have seen at all. Yesterday I started (without knowing this thread existed
RE: Question about the pipeline
Hi Maite, If you can run the cpe gui using the script in bin/ , try specifying the descriptor for that: runctakesCPE -desc pathToXml If that runs then try copying the runctakesCPE to something like runctakesCLI and change the last line of the file to call CmdLineCpeRunner instead of CpmFrame. Sean p.s. check the last line of runctakesCPE script that you are using and make sure that it passes arguments: %* for Windows or $@ for unix/linux -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Thursday, February 05, 2015 9:42 AM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Yes, it does but only in Eclipse, not in command line even though I am in the good directory. I have to look at the classpath more in details probably. Thanks for your replies. On Thu, Feb 5, 2015 at 8:08 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Maite, Without more information I can't venture a guess as to a cause of the error. If RunCPE works then why not use that? They are practically identical. Sean From: Maite Meseure Hugues [meseure.ma...@gmail.com] Sent: Thursday, February 05, 2015 8:51 AM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline I see. In my case, I am using the CPE descriptor saved from the GUI for CmdLineCpeRunner as said Sean. I've selected AggregatePlaintextProcessor.xml as AE but I have this error: Couldn't initialize processing engine. Initialization of CAS Processor with name AggregatePlaintextProcessor failed. Meanwhile, RunCPE.java works properly with the same descriptor in Eclipse. Does anyone have an idea? On Wed, Feb 4, 2015 at 12:56 PM, Lingren, Todd todd.ling...@cchmc.org wrote: Hi Maite, For each patient in my list, I create a new FilesToFiles CPE xml using some sed commands on the template original. Specifically, here's the command line argument (I'm on linux). CTAKES_HOME=... java -cp $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/ -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms512M -Xmx2048M CmdLineCpeRunner FilesToFiles_patient_cui.xml outputfile.txt I don't think it matters, but I'm using the cTAKES 3.1.0 version. Todd Lingren Biomedical Informatics Cincinnati Children’s Hospital todd.ling...@cchmc.org 513-803-9032 -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Wednesday, February 04, 2015 12:59 PM To: dev@ctakes.apache.org Subject: Re: Question about the pipeline Interesting, Todd thank you and how do you use CMdLineCpeRunner basically? Because I tested in cmd line with: java org.apache.ctakes.core.cpe.CmdLineCpeRunner [path-to-my-cpe.xml] but here is that I've got: Exception in thread main java.lang.NoClassDefFoundError: org/apache/uima/util/InvalidXMLException at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2693) at java.lang.Class.privateGetMethodRecursive(Class.java:3040) at java.lang.Class.getMethod0(Class.java:3010) at java.lang.Class.getMethod(Class.java:1776) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:54 4) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526 ) ... On Wed, Feb 4, 2015 at 8:32 AM, Lingren, Todd todd.ling...@cchmc.org wrote: Sean and Maite, FWIW, I use CmdLineCpeRunner frequently. I employ it with a bash script to automatically create a new xml file based on the subfolder names contained in the target directory. So in our HPC, it spawns a new job for each subfolder (which may have between 5 and 2500 notes). Todd Lingren Biomedical Informatics Cincinnati Children’s Hospital todd.ling...@cchmc.org 513-803-9032 -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Tuesday, February 03, 2015 2:47 PM To: dev@ctakes.apache.org Subject: RE: Question about the pipeline Hi Maite, RunCPE is a good find, and if it fits your bil hten you should use it. But it (if you mean the yTex class) doesn't take input and output directories from the command line. It does take the path to a CPE.xml file. There is a cTakes (non-yTex) equivalent named CmdLineCpeRunner. Either one of them should print a usage if you run it without arguments. As the CmdLineCpeRunner indicates, you can create a cpe .xml file with the cpe gui. Basically, start the cpe gui, select your input (reader), output (writer) and pipeline (ae) in the gui and then save the cpe descriptor (via the menubar). You can exit the gui and run either one of the cmd line utilities with the path to that cpe .xml descriptor as the argument. Please note: sometimes you have
RE: Negex
I don't know. I'm comparing what I think is the 2009 negex trigger set https://code.google.com/p/negex/source/browse/trunk/GeneralNegEx.Java.v.1.2.05092009/negex_triggers.txt with the cTakes trigger set in org.apache.ctakes.core.fsm.machine.NegationFSM.java and it looks like the cTakes set is missing some 2009 negex trigger words, such as exhibit. Anyway, you can read https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+NE+Contexts for info on adding triggers to the cTakes version. Sean -Original Message- From: John Green [mailto:john.travis.gr...@gmail.com] Sent: Monday, January 05, 2015 2:03 PM To: dev@ctakes.apache.org Cc: dev@ctakes.apache.org Subject: Re: Negex Thanks Ma'am for the input! So to clarify: ctakes added additional trigger words to the list published originally? (This is an unrelated question to the negex vs ml thread last month). Best, John — Sent from Mailbox On Mon, Jan 5, 2015 at 12:58 PM, Green, John john.gr...@usuhs.edu wrote: Hi all - Does anyone know off the top of their head if the negex trigger rules included in the original 2009 python script were added to when it was implemented in ctakes? Thanks, John
RE: Question about CPE/ descriptor and xml file.
Go through the error that you got, and look for a message like: Failed to initilize. Invalid UMLS License and Error: Invalid UMLS License. A UMLS License is required to use the UMLS dictionary lookup. Error: You may request one at: https://uts.nlm.nih.gov/license.html Please verify your UMLS license settings in the DictionaryLookupAnnotatorUMLS.xml configuration. If you see that message, you see a possible solution. If you have a umls username and password, make sure that they are set correctly for the cTakes run. If you don't see that message, check resources/org/apache/ctakes/dictionary/lookup/umls2011ab/umls and see if it contains a rather large .data file. If not, then go through the process detailed at http://ctakes.apache.org/downloads.cgi in the section entitled Resources. If you have the .data file, then let us know and we'll try to push forward. Sean -Original Message- From: Maite Meseure Hugues [mailto:mmhug...@medmergent.com] Sent: Monday, January 05, 2015 9:33 AM To: dev@ctakes.apache.org Subject: Question about CPE/ descriptor and xml file. Hello everyone, I am a new user of cTakes and I would like to integrate it in my code to run it programmatically. I followed the example in the cTakes package but I have an error message regarding the descriptor: [...] 03 Jan 2015 13:39:33 INFO UmlsDictionaryLookupAnnotator - Using ctakes.umlsaddr: https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: maitemeseure Exception in thread main org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator failed. (Descriptor: unknown) Do you know how I can fix that?? My goal is to get in output the same XML file than the CPE. Thanks a lot for your time. Best regards, Maite Meseure
RE: Is it necessary to put UMLS login into files when passing them with -D to the JVM?
Hi Tom, I am passing my UMLS login and password on startup as arguments ... -Dctakes.umlsuser=myusername -Dctakes.umlspw=mypassword That is fine. If I understand correctly you are already running this way without problem. The comments in the .xml files should probably be extended to include mention of the cmd parameters. [I] downloaded [AggregatePlaintextFastUmlsProcessor.xml] from the svn and replaced the old cTAKES 3.2.1 ... I think that this should be fine. Java code for each annotator may have changed, but I don't think that any class names (by which annotators are called) have changed. The best way to know for certain is to run it, and if you haven't seen any problems then I think that you are in good shape. Sean -Original Message- From: Tom Devel [mailto:deve...@gmail.com] Sent: Friday, March 06, 2015 3:20 PM To: dev@ctakes.apache.org Subject: Is it necessary to put UMLS login into files when passing them with -D to the JVM? Hi, in AggregatePlaintextFastUMLSProcessor.xml of cTAKES it states that: [...] Please update DictionaryLookupAnnotatorUMLS.xml file with your UMLS username and password. Similarly, in AggregatePlaintextFastUMLSProcessor.xml from https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D344d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=ZOef73O4fpDF9CZPAZHmVyDZDQDa6jKWyTTU1kikj9os=7C1osQzBp5-aSIXPeqWPXcafrLDGCeEkR3sfbiJMRDQe= [...] Please update resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml file with your UMLS username and password I am passing my UMLS login and password on startup as arguments, when starting the either CVD/CPE or org.apache.uima.examples.cpe.SimpleRunCPE argumets such as: -Dctakes.umlsuser=myusername -Dctakes.umlspw=mypassword In such a case, it is still necessary to modify the file(s) above? Additional question: It seems that the AggregatePlaintextFastUMLSProcessor.xml from https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D344d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=ZOef73O4fpDF9CZPAZHmVyDZDQDa6jKWyTTU1kikj9os=7C1osQzBp5-aSIXPeqWPXcafrLDGCeEkR3sfbiJMRDQe= has some nice improvements (using DrugNER and default fast pipeline). I just downloaded it from the svn and replaced the old cTAKES 3.2.1 file with this one, and it seems to run just fine and cTAKES does annotations. Can somebody from the devs or users tell me if this manual replacement step is OK and does not break anything that I am not aware of? Many thanks for answers on any of my questions, Tom
RE: Hello cTAKES Mailing List
The CHV is a good resource for some things, but before going through the motions of porting it to a ctakes format, take a look inside. -Original Message- From: Pei Chen [mailto:chen...@apache.org] Sent: Monday, February 23, 2015 1:52 PM To: dev@ctakes.apache.org Subject: Re: Hello cTAKES Mailing List Raymond, Probably a combination of UMLS *Consumer Health Vocabulary + Custom Dictionary (as Sean described) *may work for the use case*:* OAC CHV connects informal, common words and phrases about health to technical terms used by health care professionals. It includes jargon, slang, ambiguous, and misspelled words as used by consumers and health care professionals. Due to its nature, OAC CHV includes concepts that are not represented by other source vocabularies within the Metathesaurus. [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nlm.nih.gov_research_umls_sourcereleasedocs_current_CHV_d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=1Bkpeno1tqLjX78o0wYm5DmJHCHlK7hrxpeEgPnGtRMs=-rEmTgTCe0mkSXT34XK56zkiuy_VxIfFvngGJzUwem8e= On Sun, Feb 22, 2015 at 10:37 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Raymond, If you use the dictionary-fast module there exists an entry feeling bad with cui 557911 and cui 231218. There is also feel bad and feeling bad emotionally You will find horrible present pain but no other entry with horrible. You will not find any terms with awful and probably many other desired words. If you are really interested in slang crappy, lousy, etc. then they are definitely not present. What you can do is create a second dictionary. There are example custom dictionaries in -dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dicti onary/lookup/fast/example/bsv/ You should look at custom_cui_bsv.bsv if you want to specify term unique id codes and term text alone. If you want to add tui/group codes then look at custom_cui_tui_bsv.bsv - you will probably want to model your dictionary after this so that you can tag your terms with tuis for symptoms. You will want to imitate sections from the corresponding .xml file in that directory. Make a copy of cTakesHsql.xml (two dirs up) and add lines: dictionary nameCustomCuiRareWord/name implementationNameorg.apache.ctakes.dictionary.lookup2.BsvRareWordDictionary/implementationName properties property key=bsvPath value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/ /properties /dictionary And conceptFactory nameCustomCuiConcept/name implementationNameorg.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory/implementationName properties property key=bsvPath value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/ /properties /conceptFactory And dictionaryConceptPair nameCustomPair/name dictionaryNameCustomCuiRareWord/dictionaryName conceptFactoryNameCustomCuiConcept/conceptFactoryName /dictionaryConceptPair Then make sure that you point to your custom cTakesHsql.xml in dictionary-fast/desc/analysis_engine/UmlsLookupAnnotator.xml (or Overlap depending upon your use): nameDictionaryDescriptorFile/name description/ fileResourceSpecifier fileUrlfile:org/apache/ctakes/dictionary/lookup/fast/cTakesHsqlYourCopy.xml/fileUrl /fileResourceSpecifier You can also skip the UMLS dictionary altogether and just use your custom dictionary. If you do give this a try then let me know how it goes. If you need additional assistance let me know and I will help the best I can. Sean -Original Message- From: Raymond Li [mailto:ray...@bu.edu] Sent: Saturday, February 21, 2015 1:26 PM To: dev@ctakes.apache.org Subject: Hello cTAKES Mailing List Hello, my name is is Raymond Li and I am currently working on a team project involving cTAKES. The goal of our project would be to use cTAKES to analyze posts on social media (such as tweets, forum posts, public available data) in order to catch in real-time any adverse effects of prescribed drugs and do a public service of protecting people from harmful drugs. Aside from this introduction, I do have only one question to ask to proceed with this project: Is cTAKES capable of understanding slang words as symptoms. An example is if I were to say I took Crestor and feeling bad is there a way for cTAKES to recognize that Crestor had a negative effect? My team has not been able to isolate 'bad' as a negative effect as it is not a defined medical symptom, but it would be nice to figure out if such a solution exists, or if we would need to develop our own solution and how we could go around doing it. My team and I would appreciate any comments or assistance regarding
URGENT! RE: New Website
Hi all, It looks like a few people (myself included) are interested in having information on people, projects, papers, and applications that use cTAKES on the web page. I have created a form on google that might help us collect this and other information. Please visit https://docs.google.com/forms/d/10ryw42aqkIf2ygjNTa_To1OgGDZzDqHizVg__Jxyuws/viewform?usp=send_form Most of the form is multiple choice, so it only takes a minute or two to complete it. The more information we have the better we can develop and promote cTAKES, so this is very important. Thank you, Sean -Original Message- From: Mohammad Alodadi [mailto:mso1...@gmail.com] Sent: Wednesday, February 25, 2015 2:09 AM To: dev@ctakes.apache.org Subject: Re: New Website I like the look of the new website. I was thinking, if someone could collect references of all the research papers, that use cTakes in their methodology, in a page and include the link in the use cases page, that would be a very great idea to see the different uses of cTakes. Sincerely, Mohammad Alodadi On Feb 24, 2015, at 8:46 PM, taposh.d@kp.org wrote: Hi Michelle - The site looks nice. Would it be possible to add link to source via svn or github. Also, case studies would help potential people. Regards, Taposh D. Roy Health Data Lead Decision Support Team Kaiser Permanente Program Office 1950 Franklin Street, 17th Floor Oakland, California 94588 510-987-4121 (Office) 510-206-1633 (cell) NOTICE TO RECIPIENT: If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding or saving them. Thank you. From: Michelle Chen miche...@apache.org To: dev@ctakes.apache.org Date: 02/24/2015 04:30 PM Subject:New Website Hello everyone, We are planning on publishing the new website on March 2, 2015. Here is the link to the proposed site: https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_repos_asf_ctakes_site_new_index.htmld=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=SFiBdOhfH5CdMkKcR10nLGTTP4hqatPnp7nAnpr_ZFws=t5Zx2haIAOy9nVTKfqs7L7uRwblbLJ7imHALjP2iMqIe= . (Note: Not all of the pages are fully functional yet, but we figured that this new look is exciting news and wanted feedback.) Some ideas for feedback: 1. Succinct quotations from users and devs about How has cTakes helped you? so that we can populate the Why cTAKES? page. (with permission to use information of your name, position, employer, and/or product/project) 2. Use cases (with potential screenshots) of cTAKES to populate the Examples page of GUI or other use cases. The examples page is in the process of being revamped. 3. Mobile feedback: This has not been tested on devices, but what would be needed/useful? 4. What is missing from the web page? E.g. FAQs, useful tips. Where are there broken links? 6. Anything! We welcome any suggestions or code contributions directly the website itself. Look forward to hearing from everyone. Have a great day. Sincerely, Michelle Chen
RE: Hello cTAKES Mailing List
Hi Raymond, If you use the dictionary-fast module there exists an entry feeling bad with cui 557911 and cui 231218. There is also feel bad and feeling bad emotionally You will find horrible present pain but no other entry with horrible. You will not find any terms with awful and probably many other desired words. If you are really interested in slang crappy, lousy, etc. then they are definitely not present. What you can do is create a second dictionary. There are example custom dictionaries in -dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/example/bsv/ You should look at custom_cui_bsv.bsv if you want to specify term unique id codes and term text alone. If you want to add tui/group codes then look at custom_cui_tui_bsv.bsv - you will probably want to model your dictionary after this so that you can tag your terms with tuis for symptoms. You will want to imitate sections from the corresponding .xml file in that directory. Make a copy of cTakesHsql.xml (two dirs up) and add lines: dictionary nameCustomCuiRareWord/name implementationNameorg.apache.ctakes.dictionary.lookup2.BsvRareWordDictionary/implementationName properties property key=bsvPath value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/ /properties /dictionary And conceptFactory nameCustomCuiConcept/name implementationNameorg.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory/implementationName properties property key=bsvPath value=org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv/ /properties /conceptFactory And dictionaryConceptPair nameCustomPair/name dictionaryNameCustomCuiRareWord/dictionaryName conceptFactoryNameCustomCuiConcept/conceptFactoryName /dictionaryConceptPair Then make sure that you point to your custom cTakesHsql.xml in dictionary-fast/desc/analysis_engine/UmlsLookupAnnotator.xml (or Overlap depending upon your use): nameDictionaryDescriptorFile/name description/ fileResourceSpecifier fileUrlfile:org/apache/ctakes/dictionary/lookup/fast/cTakesHsqlYourCopy.xml/fileUrl /fileResourceSpecifier You can also skip the UMLS dictionary altogether and just use your custom dictionary. If you do give this a try then let me know how it goes. If you need additional assistance let me know and I will help the best I can. Sean -Original Message- From: Raymond Li [mailto:ray...@bu.edu] Sent: Saturday, February 21, 2015 1:26 PM To: dev@ctakes.apache.org Subject: Hello cTAKES Mailing List Hello, my name is is Raymond Li and I am currently working on a team project involving cTAKES. The goal of our project would be to use cTAKES to analyze posts on social media (such as tweets, forum posts, public available data) in order to catch in real-time any adverse effects of prescribed drugs and do a public service of protecting people from harmful drugs. Aside from this introduction, I do have only one question to ask to proceed with this project: Is cTAKES capable of understanding slang words as symptoms. An example is if I were to say I took Crestor and feeling bad is there a way for cTAKES to recognize that Crestor had a negative effect? My team has not been able to isolate 'bad' as a negative effect as it is not a defined medical symptom, but it would be nice to figure out if such a solution exists, or if we would need to develop our own solution and how we could go around doing it. My team and I would appreciate any comments or assistance regarding our project and this current issue. Thank you and have a nice day! -- Sincerely, Raymond Li
RE: ctakessorx for AggregatePlaintextFastUMLSProcessor.xml
Maite, You already have a thread going with me offline. If you have a question please ask it on that thread to refrain from spamming the devlist. Until I have a chance to create decent documentation you are stuck with me. Sean From: Maite Meseure Hugues [meseure.ma...@gmail.com] Sent: Friday, March 27, 2015 3:59 PM To: dev@ctakes.apache.org Subject: ctakessorx for AggregatePlaintextFastUMLSProcessor.xml Hi everyone, I am currently using AggregatePlaintextFastUMLSProcessor.xml and trying to use my own dictionary. I would like to understand ctakessnorx script file, how it's made etc, I didn't find any info. Thank you. -- -- Maïté Meseure Hugues
RE: Prep for upcoming cTAKES 3.2.2 Patch Release
+1 for pushing forward I may have been one of the voices commenting on memory bloat, but I agree with Pei re: improving the new. The more use, the more attention and more improvement (hopefully). I can't speak of the accuracy old v. new as I haven't actually comparatively tested them. And there is always the option of manually selecting another component. -Original Message- From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Thursday, April 30, 2015 10:25 AM To: dev@ctakes.apache.org Subject: RE: Prep for upcoming cTAKES 3.2.2 Patch Release My vote would be to push forward. The old assertion module also had it's share of bugs/issues and gives an incentive to improve the new models. And there's currently always the option for a user to easily revert back to the old since it's not removed yet... --Pei -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Thursday, April 30, 2015 9:14 AM To: dev@ctakes.apache.org Subject: Re: Prep for upcoming cTAKES 3.2.2 Patch Release A question about the default pipelines. There has been some concern about the new assertion modules (the machine learning ones that I worked on), partially due to some less intuitive error modes than negex and partially due to its reliance on the dependency parser which increases the memory footprint substantially. Should we consider reverting to the rule-based negation for the default pipeline (thus also removing the dependency parser from the default pipeline)? I'm not sure what that would mean for the other assertion modules (uncertainty, generic, subject, hypothetical) -- but I think it means they would not exist. I can see arguments both ways. I also think if we revert we would want to have some way for people to access all the machine learning assertion modules if they want them. Tim On 04/29/2015 06:04 PM, Chen, Pei wrote: FYI- I will plan to create a 3.2.2 branch from trunk this week in prep for the 3.2.2 release so others can continue their work in trunk. Feel free to put any changes in trunk now if you want to have it included in the 3.2.2 patch release. The main changes are: 1) Improved temporal models 2) Minor bug fixes reported in Jira From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] Sent: Thursday, March 12, 2015 12:55 PM To: dev@ctakes.apache.org Subject: Prep for upcoming cTAKES 3.2.2 Patch Release I was thinking of creating a 3.2.2 release for Mar (it's long passed the original Jan date?) I can volunteer to be the RM again. There are still plenty of unresolved items... If you plan to have anything you would like included in the upcoming release, please mark it in Jira and plan the commits accordingly... Jira Items: https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org _jira_issues_-3Fjql-3DfixVersion-2520-253D-25203.2.2-2520AND-2520proje ct-2520-253D-2520CTAKESd=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdio CoppxeFUr=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx 6674hm=2WI-fDHF0jDSXyUcTxv5U4_T_w9MBjbDAw3ZRYgoLXss=CF0gyLPeOyRvUjRy Vm_rcl8SaFUtPTMmfrLObpiHtxMe= 1-25 of 25 Columns T Patch Info Key Summary Assignee Reporter P Status Resolution Created Updated Due [Bug]https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apac he.org_jira_browse_CTAKES-2D349d=BQMFAgc=qS4goWBT7poplM69zy_3xhKwEW1 4JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=pMfOt BAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=OuUBnh20dG00BWWGMKNkCLddKAzEK EiFP3s5uMqcXvUe= CTAKES-349https://urldefense.proofpoint.com/v2/url?u=https-3A__issues .apache.org_jira_browse_CTAKES-2D349d=BQMFAgc=qS4goWBT7poplM69zy_3xh KwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm= pMfOtBAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=OuUBnh20dG00BWWGMKNkCLdd KAzEKEiFP3s5uMqcXvUe= JdbcWriterTemplate does not store rows if there are fewer than 100 per notehttps://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apach e.org_jira_browse_CTAKES-2D349d=BQMFAgc=qS4goWBT7poplM69zy_3xhKwEW14 JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=pMfOtB Aj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=OuUBnh20dG00BWWGMKNkCLddKAzEKE iFP3s5uMqcXvUe= Unassigned Sean Finanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apac he.org_jira_secure_ViewProfile.jspa-3Fname-3Dseanfinand=BQMFAgc=qS4g oWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEY ujfPwiAxhPVz5WYm=pMfOtBAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=0eQpWY xtyJWqM1JvCN8qkioGRcjID0-QD5k2tf9-1Rce= [Major] OPEN Unresolved 12/Mar/15 12/Mar/15 [Bug]https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apac he.org_jira_browse_CTAKES-2D347d=BQMFAgc=qS4goWBT7poplM69zy_3xhKwEW1 4JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=pMfOt BAj84JGCJYU-ZSZ6Ac5QC_d7g8ZReRfZu12U4ss=ja8aLYd7A_7XF8HGNZlgwYtf57IaT kNbKjuO-LfG1Nwe=
RE: build tool suggestion
Your IDE should have settings that allow custom warnings. Also check out findbugs -- http://en.wikipedia.org/wiki/FindBugs There might be a configurable maven plugin. It is a process ... -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Tuesday, May 05, 2015 8:01 PM To: dev@ctakes.apache.org Subject: build tool suggestion Do you know offhand, would it be easy to have something run at build time that flags uses of FileReader? Related - do we have anything at build time that produces warnings that are looked at? When I check in a change, I just check whether the next build is successful or not. I don't look for warnings other than what I see when I try a compile of my own on my own system. Ideally I think it would be good to have the use of FileReader cause a meaningful warning. But if there's no relatively easy way to do that, might we consider having it cause a build failure? I think the benefits would outweigh the drawbacks. -- James From: Chen, Pei [pei.c...@childrens.harvard.edu] Sent: Tuesday, May 05, 2015 5:55 PM To: dev@ctakes.apache.org Subject: RE: svn commit: r1677903 - in /ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2: concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java util/JdbcConnectionFactory.java Can we use InputStreamReader instead of FileReader? That way the resource can also be read from within a jar (potentially from maven central, etc.) and doesn't have to be fixed to a physical file... i.e. Instead of new BufferedReader(new FileReader(path)) new BufferedReader(new InputStreamReader(FileLocator.getAsStream(path))) --Pei -Original Message- From: seanfi...@apache.org [mailto:seanfi...@apache.org] Sent: Tuesday, May 05, 2015 6:42 PM To: comm...@ctakes.apache.org Subject: svn commit: r1677903 - in /ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2: concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java util/JdbcConnectionFactory.java Author: seanfinan Date: Tue May 5 22:41:26 2015 New Revision: 1677903 URL: https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_r1677903d=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=wuwFl1DxU-yGWdGewROupvowHfYFay_u5LYKJUJF2VAe= Log: Use FileLocator to find BSV dictionaries Modified: ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/dictionary/BsvRareWordDictionary.java ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/util/JdbcConnectionFactory.java Modified: ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java URL: https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_viewvc_ctakes_trunk_ctakes-2Ddictionary-2Dlookup-2Dfast_src_main_java_org_apache_ctakes_dictionary_lookup2_concept_BsvConceptFactory.java-3Frev-3D1677903-26r1-3D1677902-26r2-3D1677903-26view-3Ddiffd=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=N_IOanbEYnXUTZ4ZO3vIjOeYun186kZGjXPKWp-Wi7ke= == --- ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java (original) +++ ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ +++ ctakes/dictionary/lookup2/concept/BsvConceptFactory.java Tue May 5 +++ 22:41:26 2015 @@ -1,5 +1,6 @@ package org.apache.ctakes.dictionary.lookup2.concept; +import org.apache.ctakes.core.resource.FileLocator; import org.apache.ctakes.dictionary.lookup2.util.CuiCodeUtil; import org.apache.ctakes.dictionary.lookup2.util.LookupUtil; import org.apache.ctakes.dictionary.lookup2.util.TuiCodeUtil; @@ -34,11 +35,12 @@ final public class BsvConceptFactory imp } public BsvConceptFactory( final String name, final String bsvFilePath ) { - this( name, new File( bsvFilePath ) ); - } - - public BsvConceptFactory( final String name, final File bsvFile ) { - final CollectionCuiTuiTerm cuiTuiTerms = parseBsvFile( bsvFile ); +// this( name, new File( bsvFilePath ) ); +// } +// +// public BsvConceptFactory( final String name, final File bsvFile ) { +// final CollectionCuiTuiTerm cuiTuiTerms = parseBsvFile( bsvFile ); + final CollectionCuiTuiTerm cuiTuiTerms = parseBsvFile( +bsvFilePath ); final MapLong, Concept conceptMap = new HashMap( cuiTuiTerms.size() ); for ( CuiTuiTerm cuiTuiTerm : cuiTuiTerms ) {
RE: UMLS Authentication failing despite correct username and password
Hi Pedro, Check the cTakesHsql.xml and make sure that the line matches: property key=umlsUrl value=https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser/ In an older version of cTAKES with an output message as you have: 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Default - Loading dictionary into memory. Initial run may take few mins to load. Please be patient... That line got corrupted. Sean -Original Message- From: Pedro Teixeira [mailto:teixeir...@gmail.com] Sent: Monday, May 11, 2015 5:30 PM To: dev@ctakes.apache.org Subject: UMLS Authentication failing despite correct username and password So I've checked the Dictionary lookup XML file and that password works to log in via the website. This was also working last week but stopped at some point over the last week. I've got cTAKES running on a linux system so I can index batches of documents via a script. The exact error is as follows (with the username/password blocked out). 11 May 2015 15:59:26 INFO LvgCmdApiResourceImpl - cwd = /home/PT/cTAKES/apache-ctakes-3.2.1 11 May 2015 15:59:26 INFO LvgCmdApiResourceImpl - cd /home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/lvg/ 11 May 2015 15:59:27 INFO LvgCmdApiResourceImpl - cd /home/PT/cTAKES/apache-ctakes-3.2.1 11 May 2015 15:59:27 INFO ClearNLPDependencyParserAE - using Morphy analysis? true Loading configuration. Loading feature templates. Loading lexica. Loading model: 11 May 2015 15:59:42 INFO Chunker - Chunker model file: org/apache/ctakes/chunker/models/chunker-model.zip 11 May 2015 15:59:44 INFO ContextDependentTokenizerAnnotator - Finite state machines loaded. 11 May 2015 15:59:44 INFO ConstituencyParser - Initializing parser... 11 May 2015 15:59:46 INFO ContextAnnotator - SCOPE ORDER: [1, 3] 11 May 2015 15:59:46 INFO NegationContextAnalyzer - initBoundaryData() called for ContextInitializer 11 May 2015 15:59:47 INFO POSTagger - POS tagger model file: org/apache/ctakes/postagger/models/mayo-pos.zip 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Default - Loading dictionary into memory. Initial run may take few mins to load. Please be patient... 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Using dictionary lookup window type: org.apache.ctakes.typesystem.type.textspan.Sentence 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Exclusion tagset loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD VBG VBN VBP VBZ WDT WP WPS WRB 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Using minimum term text span: 3 11 May 2015 15:59:47 INFO DictionaryDescriptorParser - Parsing dictionary specifications: /home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml 11 May 2015 15:59:48 ERROR UmlsUserApprover - UMLS Account at https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=oVzYGAl69NhMu6lQpKeatJrIGk2o_z2AZvjq7Z5J69gs=_JNevHgYhyKm5PjIyFlYxIS1UWuR7J-n5V551hou2dMe= is not valid for user # with ## Couldn't initialize processing engine. Initialization of CAS Processor with name AggregatePlaintextFastUMLSProcessor failed. I also have a test implementation on a local windows 8 laptop that also fails now due to the same error so it seems like it's UMLS related issue but I haven't heard back from them yet and was hoping perhaps someone with cTAKES has previously experienced and resolved the issue. Thanks!
RE: UMLS Authentication failing despite correct username and password
Argh. Our email server may have mucked with the url that I pasted: H t t p s : / / uts - ws . nlm . nih . gov / restful / isValidUMLSUser property key=umlsUrl value= INSERT URL HERE, NO SPACES / -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Monday, May 11, 2015 5:38 PM To: dev@ctakes.apache.org Subject: RE: UMLS Authentication failing despite correct username and password Hi Pedro, Check the cTakesHsql.xml and make sure that the line matches: property key=umlsUrl value=https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=bSJDuEveKkCQoYKfh2CwhxDx8I92siVZvxm45BoxGtEs=A5wwcyQgQrPQ_dWwnaF-QHqZb0ttus_rzS-A6UDh-S8e= / In an older version of cTAKES with an output message as you have: 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Default - Loading dictionary into memory. Initial run may take few mins to load. Please be patient... That line got corrupted. Sean -Original Message- From: Pedro Teixeira [mailto:teixeir...@gmail.com] Sent: Monday, May 11, 2015 5:30 PM To: dev@ctakes.apache.org Subject: UMLS Authentication failing despite correct username and password So I've checked the Dictionary lookup XML file and that password works to log in via the website. This was also working last week but stopped at some point over the last week. I've got cTAKES running on a linux system so I can index batches of documents via a script. The exact error is as follows (with the username/password blocked out). 11 May 2015 15:59:26 INFO LvgCmdApiResourceImpl - cwd = /home/PT/cTAKES/apache-ctakes-3.2.1 11 May 2015 15:59:26 INFO LvgCmdApiResourceImpl - cd /home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/lvg/ 11 May 2015 15:59:27 INFO LvgCmdApiResourceImpl - cd /home/PT/cTAKES/apache-ctakes-3.2.1 11 May 2015 15:59:27 INFO ClearNLPDependencyParserAE - using Morphy analysis? true Loading configuration. Loading feature templates. Loading lexica. Loading model: 11 May 2015 15:59:42 INFO Chunker - Chunker model file: org/apache/ctakes/chunker/models/chunker-model.zip 11 May 2015 15:59:44 INFO ContextDependentTokenizerAnnotator - Finite state machines loaded. 11 May 2015 15:59:44 INFO ConstituencyParser - Initializing parser... 11 May 2015 15:59:46 INFO ContextAnnotator - SCOPE ORDER: [1, 3] 11 May 2015 15:59:46 INFO NegationContextAnalyzer - initBoundaryData() called for ContextInitializer 11 May 2015 15:59:47 INFO POSTagger - POS tagger model file: org/apache/ctakes/postagger/models/mayo-pos.zip 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Default - Loading dictionary into memory. Initial run may take few mins to load. Please be patient... 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Using dictionary lookup window type: org.apache.ctakes.typesystem.type.textspan.Sentence 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Exclusion tagset loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD VBG VBN VBP VBZ WDT WP WPS WRB 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Using minimum term text span: 3 11 May 2015 15:59:47 INFO DictionaryDescriptorParser - Parsing dictionary specifications: /home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml 11 May 2015 15:59:48 ERROR UmlsUserApprover - UMLS Account at https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=oVzYGAl69NhMu6lQpKeatJrIGk2o_z2AZvjq7Z5J69gs=_JNevHgYhyKm5PjIyFlYxIS1UWuR7J-n5V551hou2dMe= is not valid for user # with ## Couldn't initialize processing engine. Initialization of CAS Processor with name AggregatePlaintextFastUMLSProcessor failed. I also have a test implementation on a local windows 8 laptop that also fails now due to the same error so it seems like it's UMLS related issue but I haven't heard back from them yet and was hoping perhaps someone with cTAKES has previously experienced and resolved the issue. Thanks!
RE: build tool suggestion
I understood that. I check warnings before checkin. You can do a search for something like https://wiki.jenkins-ci.org/display/JENKINS/Warnings+Plugin -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, May 06, 2015 10:58 AM To: 'dev@ctakes.apache.org' Subject: RE: build tool suggestion Sorry, I wasn't clear, when I said at build time, I meant the Jenkins automated build. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, May 06, 2015 9:52 AM To: dev@ctakes.apache.org Subject: RE: build tool suggestion Your IDE should have settings that allow custom warnings. Also check out findbugs -- https://urldefense.proofpoint.com/v2/url?u=http-3A__en.wikipedia.org_wiki_FindBugsd=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lQlT1hCegc_WtmY10BMmAxiwIIHNxqohrwW7CfGCFq8s=OQDh6ra47IQjNVh7WZteWKCf_xeSae36jIo_qcjxfS8e= There might be a configurable maven plugin. It is a process ... -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Tuesday, May 05, 2015 8:01 PM To: dev@ctakes.apache.org Subject: build tool suggestion Do you know offhand, would it be easy to have something run at build time that flags uses of FileReader? Related - do we have anything at build time that produces warnings that are looked at? When I check in a change, I just check whether the next build is successful or not. I don't look for warnings other than what I see when I try a compile of my own on my own system. Ideally I think it would be good to have the use of FileReader cause a meaningful warning. But if there's no relatively easy way to do that, might we consider having it cause a build failure? I think the benefits would outweigh the drawbacks. -- James From: Chen, Pei [pei.c...@childrens.harvard.edu] Sent: Tuesday, May 05, 2015 5:55 PM To: dev@ctakes.apache.org Subject: RE: svn commit: r1677903 - in /ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2: concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java util/JdbcConnectionFactory.java Can we use InputStreamReader instead of FileReader? That way the resource can also be read from within a jar (potentially from maven central, etc.) and doesn't have to be fixed to a physical file... i.e. Instead of new BufferedReader(new FileReader(path)) new BufferedReader(new InputStreamReader(FileLocator.getAsStream(path))) --Pei -Original Message- From: seanfi...@apache.org [mailto:seanfi...@apache.org] Sent: Tuesday, May 05, 2015 6:42 PM To: comm...@ctakes.apache.org Subject: svn commit: r1677903 - in /ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2: concept/BsvConceptFactory.java dictionary/BsvRareWordDictionary.java util/JdbcConnectionFactory.java Author: seanfinan Date: Tue May 5 22:41:26 2015 New Revision: 1677903 URL: https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_r1677903d=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=wuwFl1DxU-yGWdGewROupvowHfYFay_u5LYKJUJF2VAe= Log: Use FileLocator to find BSV dictionaries Modified: ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/dictionary/BsvRareWordDictionary.java ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/util/JdbcConnectionFactory.java Modified: ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java URL: https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_viewvc_ctakes_trunk_ctakes-2Ddictionary-2Dlookup-2Dfast_src_main_java_org_apache_ctakes_dictionary_lookup2_concept_BsvConceptFactory.java-3Frev-3D1677903-26r1-3D1677902-26r2-3D1677903-26view-3Ddiffd=BQICaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WYm=9sLhiql1kiKYdaC8Nx3dTASt89nXQA3uy4kwesnHIags=N_IOanbEYnXUTZ4ZO3vIjOeYun186kZGjXPKWp-Wi7ke= == --- ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ctakes/dictionary/lookup2/concept/BsvConceptFactory.java (original) +++ ctakes/trunk/ctakes-dictionary-lookup-fast/src/main/java/org/apache/ +++ ctakes/dictionary/lookup2/concept/BsvConceptFactory.java Tue May 5 +++ 22:41:26 2015 @@ -1,5 +1,6 @@ package org.apache.ctakes.dictionary.lookup2.concept; +import org.apache.ctakes.core.resource.FileLocator; import org.apache.ctakes.dictionary.lookup2.util.CuiCodeUtil; import
RE: UMLS Authentication failing despite correct username and password
Hi Pedro, B). If the user has already downloaded the UMLS isn't that already indicative that they had a valid account? As I understand it (I wasn't around at the time) this per-user licensing with a jit check was the deal that was worked out with the NLM. I think that repackaging and redistributing any form of the UMLS was not (legally) done before ctakes worked out the current arrangement. I think have heard ytex had an initial check upon installation, and we have talked about (would like to) use this model. The only drawback is a single download, multiple install site distribution possibility - which NLM didn't like. My information could be woefully outdated or just plain wrong. If anybody out there knows better then please chip in. Sean P.S. If anybody would like to try to advocate a different arrangement with the NLM then that would be great. -Original Message- From: Pedro [mailto:teixeir...@gmail.com] Sent: Thursday, May 14, 2015 9:43 AM To: dev@ctakes.apache.org Subject: Re: UMLS Authentication failing despite correct username and password Agreed. Doing a direct string comparison seems like it will just break at the very next update. A). A check to parse the XML result looking for a result tag and that the contents are True seems better B). I'm not familiar with the history of that particular check but it seems overly restrictive to require a valid UMLS account check for every single run. If the user has already downloaded the UMLS isn't that already indicative that they had a valid account? I realize there are more ways around it in that case but requiring an internet connection just to run one of the UMLS analysis engines seems... suboptimal. Thanks for all the help sorting this out!
RE: UMLS Authentication failing despite correct username and password
Hi Michal, Thank you very much for pinpointing the problem. Pei created Jira CTAKES-359. I checked in a fix for both the -old- and -fast- dictionary lookups. I also reported the problem to the UMLS people and forwarded your discovery to their mailing list. Unfortunately, all ctakes users need to upgrade to today's trunk version - or at least incorporate the required changes. Pei is making sure that it gets moved into the release candidate. Cheers, Sean -Original Message- From: michal.iglew...@uqo.ca [mailto:michal.iglew...@uqo.ca] Sent: Monday, May 11, 2015 11:27 PM To: dev@ctakes.apache.org Subject: RE: UMLS Authentication failing despite correct username and password Hi Pedro and Sean, It seems to me that the service https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=jZe4a0OF4b0UehhgbKoEMUfkTADm8RVRexPavSKlqCEs=E62_dTnV7yCr1SUBnSbsSxcmyckz4y-PQkFQGoB3WQUe= returns now ?xml version='1.0' encoding='UTF-8'?Resulttrue/Result instead of Resulttrue/Result. It means that the line result = line.trim().equalsIgnoreCase(Resulttrue/Result); in isValidUMLSUser() should be replaced with result = line.trim().equalsIgnoreCase(?xml version='1.0' encoding='UTF-8'?Resulttrue/Result); Michal -Message d'origine- De : Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Envoyé : May-11-15 5:41 PM À : dev@ctakes.apache.org Objet : RE: UMLS Authentication failing despite correct username and password Argh. Our email server may have mucked with the url that I pasted: H t t p s : / / uts - ws . nlm . nih . gov / restful / isValidUMLSUser property key=umlsUrl value= INSERT URL HERE, NO SPACES / -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Monday, May 11, 2015 5:38 PM To: dev@ctakes.apache.org Subject: RE: UMLS Authentication failing despite correct username and password Hi Pedro, Check the cTakesHsql.xml and make sure that the line matches: property key=umlsUrl value=https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=bSJDuEveKkCQoYKfh2CwhxDx8I92siVZvxm45BoxGtEs=A5wwcyQgQrPQ_dWwnaF-QHqZb0ttus_rzS-A6UDh-S8e= / In an older version of cTAKES with an output message as you have: 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Default - Loading dictionary into memory. Initial run may take few mins to load. Please be patient... That line got corrupted. Sean -Original Message- From: Pedro Teixeira [mailto:teixeir...@gmail.com] Sent: Monday, May 11, 2015 5:30 PM To: dev@ctakes.apache.org Subject: UMLS Authentication failing despite correct username and password So I've checked the Dictionary lookup XML file and that password works to log in via the website. This was also working last week but stopped at some point over the last week. I've got cTAKES running on a linux system so I can index batches of documents via a script. The exact error is as follows (with the username/password blocked out). 11 May 2015 15:59:26 INFO LvgCmdApiResourceImpl - cwd = /home/PT/cTAKES/apache-ctakes-3.2.1 11 May 2015 15:59:26 INFO LvgCmdApiResourceImpl - cd /home/PT/cTAKES/apache-ctakes-3.2.1/resources/org/apache/ctakes/lvg/ 11 May 2015 15:59:27 INFO LvgCmdApiResourceImpl - cd /home/PT/cTAKES/apache-ctakes-3.2.1 11 May 2015 15:59:27 INFO ClearNLPDependencyParserAE - using Morphy analysis? true Loading configuration. Loading feature templates. Loading lexica. Loading model: 11 May 2015 15:59:42 INFO Chunker - Chunker model file: org/apache/ctakes/chunker/models/chunker-model.zip 11 May 2015 15:59:44 INFO ContextDependentTokenizerAnnotator - Finite state machines loaded. 11 May 2015 15:59:44 INFO ConstituencyParser - Initializing parser... 11 May 2015 15:59:46 INFO ContextAnnotator - SCOPE ORDER: [1, 3] 11 May 2015 15:59:46 INFO NegationContextAnalyzer - initBoundaryData() called for ContextInitializer 11 May 2015 15:59:47 INFO POSTagger - POS tagger model file: org/apache/ctakes/postagger/models/mayo-pos.zip 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Default - Loading dictionary into memory. Initial run may take few mins to load. Please be patient... 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Using dictionary lookup window type: org.apache.ctakes.typesystem.type.textspan.Sentence 11 May 2015 15:59:47 INFO AbstractJCasTermAnnotator - Exclusion tagset loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD VBG VBN VBP VBZ WDT WP WPS WRB 11 May 2015 15
RE: DB DictionaryLookupAnnotator sqlserver exception
Hi Alex, This is some pretty odd behavior. Obviously, it is indicating that the resource type loaded or specified is not the correct class. Specification is (for the standard UMLS pipeline) in ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml lines #226 and #289. Both should be implementationNameorg.apache.ctakes.core.resource.JdbcConnectionResourceImpl/implementationName There is an identical specification on line #352, but that is for Orangebook which (I'm pretty sure) is no longer used and I think that this is one of a couple sections that was missed during refactoring, so you can ignore it. If you are running from source then you could try editing org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.java lines #140, #141 and add to the exception message something like + instead of + (extResrc == null ? NULL : extResrc.getClass().getName() ) To find out what it thinks that it has underfoot. Sean From: Milinovich, Alex [mailto:mili...@ccf.org] Sent: Wednesday, April 15, 2015 12:50 PM To: dev@ctakes.apache.org Subject: DB DictionaryLookupAnnotator sqlserver exception Attempting to use the sqlserver jdbc connection for the DictionaryLookupAnnotator. When loading the aggregate engine, the connection is established fine, but then it gives the error - java.lang.Exception: Expected external resource to be:interface org.apache.ctakes.core.resource.JdbcConnectionResource at org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.parseDictionaryXml(LookupParseUtilities.java:140) at org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.parseDictionaries(LookupParseUtilities.java:94) at org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.parseDescriptor(LookupParseUtilities.java:80) at org.apache.ctakes.dictionary.lookup.ae.DictionaryLookupAnnotator.configInit(DictionaryLookupAnnotator.java:88) ... 26 more Any ideas as to why this isn't working? [cid:image001.jpg@01D0777A.A2C77340] Alex Milinovich | System Analyst III | Quantitative Health Sciences 9500 Euclid Ave. - JJN3 | Cleveland, OH 44195 | p: (216) 444-9931 | m: (216) 245-7655 === Please consider the environment before printing this e-mail Cleveland Clinic is ranked as one of the top hospitals in America by U.S.News World Report (2014). Visit us online at http://www.clevelandclinic.org for a complete listing of our services, staff and locations. Confidentiality Note: This message is intended for use only by the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy. Thank you.
RE: TimeLanes
Hi Maashu, TimeLanes is currently a prototype gui under development and there is probably no information about it on the web. It is in sandbox because it isn't part of the ctakes release and is missing much needed functionality. For instance, It should display basic information about the patient and note (name, birth date, note date), but such things are often in structured data or some custom header of the note. Right now TimeLanes does not fetch them at all (it will require custom readers) and just displays Dan Testing. If you want to run it, the main class is org.chboston.cnlp.timeline.gui.main.TimelineMain . Upon startup it will display open a note. You can use the Open button or drag a file into the box. Unfortunately, it does not yet run ctakes (coming soon), so you need to give it an annotated (protégé or Anafora) note or .xmi . Using an .xmi would probably be easiest as you can create it with ctakes. You can watch an outdated video here: https://www.youtube.com/watch?v=Kp9YE0o3urUfeature=youtu.be Sean -Original Message- From: maa...@gmail.com [mailto:maa...@gmail.com] Sent: Friday, June 12, 2015 1:18 PM To: dev@ctakes.apache.org Subject: TimeLanes Hi All, I've just started working with cTAKES and was curious about TimeLanes. I found it in the sandbox here: https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_timelanes_d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=qneEArWy0QvCgMGCuF8-DwG3kslsrGAKWFtmP174uO4s=iZj-v0HJjZccezixIOmlTFwyIGFf9OqImfSv-aMKdgIe= But I'm lost on how to actually use it. I've googled around but there seems to be very little information on it. Can anyone point me in the right direction? Thanks in advance! Cheers, -Maashu -- If you are immune to boredom, there is literally nothing you cannot accomplish. -David Foster Wallace
RE: RareWord term
Hi Maite, I hope to have a paper out on this soon, so I am keeping things kind of quiet about it - though one can always look at the database and code to get an idea of what it means. For anything else in the module, you can look at the wiki page: https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+-+Fast+Dictionary+Lookup Sean -Original Message- From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] Sent: Thursday, June 18, 2015 12:02 PM To: dev@ctakes.apache.org Subject: RareWord term Hi everyone, I am currently using UmlsJdbcRareWordDictionary and I would like to better understand how is chosen the rare word term. I found this comment ' Dictionary used to lookup terms by the most rare word within them' but no more explanation, does anyone have any pointers? Thank you in advance. Maite
RE: TimeLanes
Just for clarification, TimeLanes does consume ctakes output (.xmi), but it does not produce it. In other words, you cannot hand it a plain text file and expect automatic processing. Yet. -Original Message- From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] Sent: Monday, June 22, 2015 3:02 PM To: dev@ctakes.apache.org Subject: RE: TimeLanes The cTAKES temporal component is in the main release. You can get the system output, but as Sean said TimeLanes does not consume it yet. A demo of the cTAKES temporal component can be found in Getting Started - Demos. Pei just put it up there, thank you very much, Pei! --Guergana -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Monday, June 22, 2015 11:36 AM To: dev@ctakes.apache.org Subject: RE: TimeLanes Hi Maashu, TimeLanes is currently a prototype gui under development and there is probably no information about it on the web. It is in sandbox because it isn't part of the ctakes release and is missing much needed functionality. For instance, It should display basic information about the patient and note (name, birth date, note date), but such things are often in structured data or some custom header of the note. Right now TimeLanes does not fetch them at all (it will require custom readers) and just displays Dan Testing. If you want to run it, the main class is org.chboston.cnlp.timeline.gui.main.TimelineMain . Upon startup it will display open a note. You can use the Open button or drag a file into the box. Unfortunately, it does not yet run ctakes (coming soon), so you need to give it an annotated (protégé or Anafora) note or .xmi . Using an .xmi would probably be easiest as you can create it with ctakes. You can watch an outdated video here: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_watch-3Fv-3DKp9YE0o3urU-26feature-3Dyoutu.bed=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmPm=P2Q3bVKBdvXziFnahfApZEyBbj-eR-wV-TfEZfTtl0Qs=1HETvigL__bzBXBpv2jLdRJMvJ3CI77UQZORumsBJIMe= Sean -Original Message- From: maa...@gmail.com [mailto:maa...@gmail.com] Sent: Friday, June 12, 2015 1:18 PM To: dev@ctakes.apache.org Subject: TimeLanes Hi All, I've just started working with cTAKES and was curious about TimeLanes. I found it in the sandbox here: https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_timelanes_d=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=qneEArWy0QvCgMGCuF8-DwG3kslsrGAKWFtmP174uO4s=iZj-v0HJjZccezixIOmlTFwyIGFf9OqImfSv-aMKdgIe= But I'm lost on how to actually use it. I've googled around but there seems to be very little information on it. Can anyone point me in the right direction? Thanks in advance! Cheers, -Maashu -- If you are immune to boredom, there is literally nothing you cannot accomplish. -David Foster Wallace
RE: cTakes - hsqldb connection problem
Hi Pankaj, I haven't seen this exact error before. I guess that my first steps toward a possible remedy would be: - check for existence of /org/apache/ctakes/dictionary/lookup/umls2011ab/umls.properties - make sure that it (resources/) is in your classpath - see if it looks like any of the umls2011ab/ files were not fully downloaded (ls -l : 99069136, 410610240, 1295, 705) I looked at the hsql source a little bit and can't really make heads or tails of why you'd get a 452 error (file input/output) associated with a null pointer exception (NPE) with the file path actually listed. I didn't look too far into the tree but it doesn't look like it is thrown by any of the main entry points. Do you have more than one version of hsql installed? I only ask because the single report of a similar error message( 452: NPE) that I found on the web reported it solved when they equalized all the versions. It doesn't make sense to me, but it is something to check. Sean -Original Message- From: Pankaj Shinde [mailto:pankaj.shi...@krixi.com] Sent: Tuesday, June 02, 2015 2:46 AM To: dev@ctakes.apache.org Subject: cTakes - hsqldb connection problem Hi, I have done following to get cTakes working. 1. Created java project 2. Created java class 3. Instanciated BagOfCUIsGenerator class with two arguments, input folder and output folder. 4. Added all required files in this java project. When I try to run application I am getting following error. I ran application in 'Debug' mode and I traced exception. I found out that exception is raised in JdbcConnectionResourceImpl.java file at line number 109, iv_conn is null. It seems that application is not properly connecting to hsqldb database. Error is as follows *Loading model:* *.* *Loading configuration.* *Loading feature templates.* *Loading lexica.* *Loading model:* ** *Loading model:* *.* *Exception in thread main org.apache.uima.resource.ResourceInitializationException* * at org.apache.ctakes.core.resource.JdbcConnectionResourceImpl.load(JdbcConnectionResourceImpl.java:130)* * at org.apache.uima.resource.impl.ResourceManager_impl.registerResource(ResourceManager_impl.java:603)* * at org.apache.uima.resource.impl.ResourceManager_impl.initializeExternalResources(ResourceManager_impl.java:442)* * at org.apache.uima.resource.Resource_ImplBase.initialize(Resource_ImplBase.java:153)* * at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.initialize(AnalysisEngineImplBase.java:157)* * at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:123)* * at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)* * at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)* * at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)* * at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:387)* * at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:254)* * at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:431)* * at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375)* * at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:185)* * at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)* * at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)* * at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)* * at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:314)* * at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:425)* * at org.apache.uima.fit.factory.AnalysisEngineFactory.createEngineFromPath(AnalysisEngineFactory.java:773)* * at org.apache.ctakes.clinicalpipeline.runtime.BagOfAnnotationsGenerator.init(BagOfAnnotationsGenerator.java:60)* * at org.apache.ctakes.clinicalpipeline.runtime.BagOfAnnotationsGenerator.init(BagOfAnnotationsGenerator.java:54)* * at org.apache.ctakes.clinicalpipeline.runtime.BagOfCUIsGenerator.init(BagOfCUIsGenerator.java:34)* * at com.krixi.cTakesDemo.main(cTakesDemo.java:12)* * at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)* * at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)* * at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)* * at java.lang.reflect.Method.invoke(Method.java:606)* * at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)* *Caused by: java.sql.SQLException: File input/output error /org/apache/ctakes/dictionary/lookup/umls2011ab/umls.properties java.lang.NullPointerException* * at
RE: The fast dictionary pipeline vs. the regular one
Hi Oranit, Each is the Preferred Term in at least one of the 150 sources in the Metathesaurus. Neither is from a WHO vocabulary source. The terms are related in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma Multiforme is the Narrower term (RN). Hmmm, I'm not sure why they assigned narrower and broader ... The two are from different source dictionaries and not related in such a manner. Again, the WHO term is from the Mesh and NCI sources, while the full GBM spell-out is from CSP. None are from the source named WHO (for adverse drugs). See http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html The WHO classification scheme does not have gioblastoma multiforme at all, just gioblastoma. Hence there cannot be a hierarchical relationship in that ontology. Check the paper on the latest WHO classification of brain tumours: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1929165/ Or check the definition from the National Brain Tumor Society's Tumor Types page: http://www.abta.org/brain-tumor-information/types-of-tumors/ Astrocytoma Grade IV (also called Glioblastoma, previously named “Glioblastoma Multiforme,” “Grade IV Glioblastoma,” and “GBM”)— There are two types of astrocytoma grade IV—primary, or de novo, and secondary. Primary tumors are very aggressive and the most common form of astrocytoma grade IV. The secondary tumors are those which originate as a lower-grade tumor and evolve into a grade IV tumor. Keep in mind that the umls is a living document and corrections are made all the time - it is not flawless and this might be a case that should be reported. In the regular pipeline, the concept array of gbm contains the CUI of Glioblastoma only, while in the fast pipeline, the concept array of GBM contains the CUIs of both Glioblastoma and glioblastoma Multiforme. Another thing to keep in mind is that the regular pipeline does not always provide the best discoveries. In this case, if it is not giving you gioblastoma multiforme for GBM then it is providing incomplete information - as gioblastoma multiforme is exactly what GBM stands for and that cui should be provided when gbm is discovered. Otherwise, if a researcher (possibly more inclined to use ...multiforme than a clinician) is searching for the ...multiforme cui then they will not find what they are looking for and may think that a gbm does not exist. I hope that this clears the air, Sean -Original Message- From: Oranit Dror [mailto:ora...@algotec.co.il] Sent: Monday, June 29, 2015 4:44 AM To: dev@ctakes.apache.org Subject: RE: The fast dictionary pipeline vs. the regular one Hi, Thank you all for the detailed replies. Per the Glioblastoma and Glioblastoma Multiforme terms, I have contacted NLM with my question and their answer was as follows: Each is the Preferred Term in at least one of the 150 sources in the Metathesaurus. Neither is from a WHO vocabulary source. The terms are related in that Glioblastoma is the Broader term (RB) of the 2 and Glioblastoma Multiforme is the Narrower term (RN). In the regular pipeline, the concept array of gbm contains the CUI of Glioblastoma only, while in the fast pipeline, the concept array of GBM contains the CUIs of both Glioblastoma and glioblastoma Multiforme. Best, Oranit. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Monday, June 22, 2015 5:13 PM To: dev@ctakes.apache.org Subject: RE: The fast dictionary pipeline vs. the regular one Hi all, I’m glad that there continues to be interest in the fast alternative to the dictionary lookup and I welcome all testing. GBM actually is Glioblastoma Multiforme – hence the “M”. The WHO name is the abbreviated “Glioblastoma”, but they are actually not (as far as I can discern) different things. If you check the metathesaurus 2011ab, GBM brings up both Glioblastoma C0017636 and Glioblastoma Multiforme C1621958. The first comes from Mesh and NCI, the second from CSP. If you look at the definitions they are synonymous: “malignant form of astrocytoma histologically characterized by pleomorphism of cells, nuclear atypia, microhemorrhage and necrosis; may arise in any region of the central nervous system, with a predilection for the cerebral hemispheres, basal ganglia, and commissural pathways.” Mapping to a different CUI in the UMLS does not always mean that they are truly different concepts. It often means that they came from 2 different source dictionaries (such as in this case). Also check https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Glioblastoma-5Fmultiformed=BQIGaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=nW5NpS7rJf0J_U27HFbGMu27dHHLm6fhDKfHs1q2VAQs=iEMBwhyzVtmLoWuNrEm-yfm0odtihzXzUyrfBq53B9Qe= But I am a little confused: are you saying that you got
RE: how to run i2b2 data
Hi Justin, If you check out the source code, you should be able to find that class in the ctakes-core component. Sean -Original Message- From: Justin Zhang [mailto:justinzhang...@gmail.com] Sent: Friday, August 07, 2015 10:45 AM To: dev@ctakes.apache.org Subject: Re: how to run i2b2 data Thanks Sean for your understanding, and I am in hope now. Where is the best place to start looking at regarding create a collection reader that works similarly to org.apache.ctakes.core.cr. FilesInDirectoryCollectionReader? Justin On Wed, Aug 5, 2015 at 7:24 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Justin, A shot in the dark: You could create a collection reader that works similarly to org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader , but instead of grabbing all of the files in a directory it grabs all the records parsed from a single .xml and runs a pipeline per record. Basically, swap a directory for an .xml, a text file for an xml element containing a record. Somebody out there might have something that already does as much. Sean -Original Message- From: Justin Zhang [mailto:justinzhang...@gmail.com] Sent: Wednesday, August 05, 2015 6:40 PM To: u...@ctakes.apache.org; dev@ctakes.apache.org Subject: how to run i2b2 data Hello everyone, I am running ctakes with i2b2 data https://urldefense.proofpoint.com/v2/url?u=https-3A__www.i2b2.org_NLP_ DataSets_Main.phpd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe FUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=IygWj6YGkcjofGRbrDi FJacJHMaBveHR9qzY0VD1AAEs=swpt3QP4-B392iLlJ9wypBwD17tRDOCxPdSZOW1rS8s e= In each xml file, there are multiple patient records. I am able to separate each patient into single files and process them with runCPE.sh Is there a way to convert this single xml file into the format ctakes accepted, and process as a single input file, and generate a single output file (results labelled by patient id). For example, each patient id has a smoking status. Thanks, -- Justin -- Justin
RE: how to run i2b2 data
Hi Justin, A shot in the dark: You could create a collection reader that works similarly to org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader , but instead of grabbing all of the files in a directory it grabs all the records parsed from a single .xml and runs a pipeline per record. Basically, swap a directory for an .xml, a text file for an xml element containing a record. Somebody out there might have something that already does as much. Sean -Original Message- From: Justin Zhang [mailto:justinzhang...@gmail.com] Sent: Wednesday, August 05, 2015 6:40 PM To: u...@ctakes.apache.org; dev@ctakes.apache.org Subject: how to run i2b2 data Hello everyone, I am running ctakes with i2b2 data https://urldefense.proofpoint.com/v2/url?u=https-3A__www.i2b2.org_NLP_DataSets_Main.phpd=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=IygWj6YGkcjofGRbrDiFJacJHMaBveHR9qzY0VD1AAEs=swpt3QP4-B392iLlJ9wypBwD17tRDOCxPdSZOW1rS8se= In each xml file, there are multiple patient records. I am able to separate each patient into single files and process them with runCPE.sh Is there a way to convert this single xml file into the format ctakes accepted, and process as a single input file, and generate a single output file (results labelled by patient id). For example, each patient id has a smoking status. Thanks, -- Justin
RE: Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator
Hi Jakob, The LookupDesc.xml file is supposed to be editable by the user in order to enter umls username and password information. If the file was in a resource .jar that would be pretty difficult. Umls user information can also be specified on the command line, so perhaps the whole .xml scenario should be rethought. It could easily be changed as long as users all agree to stick to the command-line umls user specification only. Do you feel like submitting a JIRA item? Sean -Original Message- From: Jakob Rogstadius [mailto:jakob.rogstad...@who-umc.org] Sent: Monday, July 20, 2015 4:41 AM To: dev@ctakes.apache.org Subject: RE: Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator Hi Sean, Thanks for your response. I had to work on something else for a couple of days, but now I'm back at it. As you say, I get UmlsDictionaryLookupAnnotator to work when I manually copy the files from the subversion repository to my local project. What I have now looks like this: project-name project-name/src/main/java/... project-name/data/... project-name/resources/... project-name/org/apache/ctakes/dictionary/lookup/... (this folder was copied from cTakes svn and is where LookupDesc.xml and the others files are located) However, this doesn't seem like the right approach at all. The other cTakes components that I have tried using have all imported neatly as jars from Maven central, together with their -res jars which contain the descriptor files and other resources that they reference. At no point have I previously downloaded the source project from the SVN server, and everything except the UMLS dictionary lookups have worked this way. I am confused. You say that the -res jars are not supposed to contain these files, but then what are they supposed to contain? As I mentioned below, the current -res jar for UmlsDictionaryLookupAnnotator has no content, except for META-INF. And is this really the only way I can get the components to work? What am I missing? In case it matters, I instantiate the annotator as follows using uimaFit: AggregateBuilder aggregate = new AggregateBuilder(); ... aggregate.add(UmlsDictionaryLookupAnnotator.createAnnotatorDescription()); ... AnalysisEngine aggregateEngine = aggregate.createAggregate(); ... SimplePipeline.runPipeline(reader, aggregateEngine, writer, evaluator); Best regards, Jakob -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: den 10 juli 2015 18:29 To: dev@ctakes.apache.org Subject: RE: Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator Hi Jakob, The -res jars aren't supposed to contain those files. The files should be placed in the resources/ directory under the ctakes root parallel to lib/. Can you take me through your checkout / installation and build / run steps? A list of your svn and maven commands might help me figure out what step is failing you. Sean -Original Message- From: Jakob Rogstadius [mailto:jakob.rogstad...@who-umc.org] Sent: Friday, July 10, 2015 3:04 AM To: dev@ctakes.apache.org Subject: RE: Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator Hi Sean, Many thanks for your reply. Like you say, I see both the lookup descriptors and all other resources in the projects on the svn server (https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_trunk_d=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=Izx33vJrQxf37pZxy4Ha128D0yl2ak1hSbm4Jp9kX5Es=jkiqjWJUTs0H_ntVqssGJ2R6yWWYlNTVbWR6snNFxAMe= ). However, the -res jars that I get through maven are completely empty, except for their META-INF folders. For other components, their -res jars do contain their resources as expected. Could something have gone wrong while publishing recent versions of these two? These are my relevant maven imports: dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup/artifactId version3.2.2/version /dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-res/artifactId version3.2.2/version /dependencydependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-fast/artifactId version3.2.2/version /dependency dependency groupIdorg.apache.ctakes/groupId artifactIdctakes-dictionary-lookup-fast-res/artifactId version3.2.2/version /dependency Jar content: https://urldefense.proofpoint.com/v2/url?u=http-3A__grepcode.com_snapshot_repo1.maven.org_maven2_org.apache.ctakes_ctakes-2Ddictionary-2Dlookup-2Dres_3.2.1_d=BQIFAgc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=Izx33vJrQxf37pZxy4Ha128D0yl2ak1hSbm4Jp9kX5Es=eQsXb82VZXQ5MK1KABI7mJVs
RE: Invalid UMLS License
Hi Justin, The UMLS licensing issue has been resolved: https://issues.apache.org/jira/browse/CTAKES-359 Any version built after May 12th 2015 should have the fix. Sean -Original Message- From: Justin Zhang [mailto:justinzhang...@gmail.com] Sent: Sunday, July 26, 2015 9:21 AM To: u...@ctakes.apache.org; dev@ctakes.apache.org Subject: Invalid UMLS License Hello Everyone and Sir Miller, Timothy Has the UMLS license issue discussed in the following link be resolved? https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_ctakes-2Duser_201505.mbox_-253CE084D8EFE2B03A408B324458C5212E945305DD21-40CHEXMBX3B.CHBOSTON.ORG-253Ed=BQIBaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=3aIq21IOPN1iyBDOER6I0oZo91kp0ZvFpxVqopVyOjMs=pwjK4pNoPHvoDDd9sK40bk0-_SOQ7MGiA1TNMLplMwIe= -- Thanks, Justin
RE: Invalid UMLS License
VBG VBN VBP VBZ WDT WP WPS WRB 27 Jul 2015 10:39:20 INFO AbstractJCasTermAnnotator - Using minimum term text span: 3 27 Jul 2015 10:39:21 INFO DictionaryDescriptorParser - Parsing dictionary specifications: /Users/justin/App/eclipse_mars/workspace_eclipse_mars/ctakes/ctakes-dictionary-lookup-fast-res/target/classes/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml 27 Jul 2015 10:39:21 INFO UmlsUserApprover - Checking UMLS Account at https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIFaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lbEdKoY0da48z2A6Xnehe0DtLHhe9WSu6whMx1DpeS8s=2Mg1-XF2l5zWbSeV2-H2my6WBXiFuqcHNXpRSy7u-gYe= for user zhangjustin -Dctakes.umlspw=20aug10! -Djava.util.logging.config.file=/Logger.properties: . 27 Jul 2015 10:39:21 ERROR UmlsUserApprover - UMLS Account at https://urldefense.proofpoint.com/v2/url?u=https-3A__uts-2Dws.nlm.nih.gov_restful_isValidUMLSUserd=BQIFaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lbEdKoY0da48z2A6Xnehe0DtLHhe9WSu6whMx1DpeS8s=2Mg1-XF2l5zWbSeV2-H2my6WBXiFuqcHNXpRSy7u-gYe= is not valid for user myuseraccount -Dctakes.umlspw= -Djava.util.logging.config.file=/Logger.properties with CHANGEME On Mon, Jul 27, 2015 at 8:32 AM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: Hi Justin, The UMLS licensing issue has been resolved: https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org _jira_browse_CTAKES-2D359d=BQIFaQc=qS4goWBT7poplM69zy_3xhKwEW14JZMSd ioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTaom=lbEdKoY0da4 8z2A6Xnehe0DtLHhe9WSu6whMx1DpeS8s=Z5hOF2WyiKrmPoizIO9D9lYAMHqRyMSHsKl gxXunPY4e= Any version built after May 12th 2015 should have the fix. Sean -Original Message- From: Justin Zhang [mailto:justinzhang...@gmail.com] Sent: Sunday, July 26, 2015 9:21 AM To: u...@ctakes.apache.org; dev@ctakes.apache.org Subject: Invalid UMLS License Hello Everyone and Sir Miller, Timothy Has the UMLS license issue discussed in the following link be resolved? https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.ap ache.org_mod-5Fmbox_ctakes-2Duser_201505.mbox_-253CE084D8EFE2B03A408B3 24458C5212E945305DD21-40CHEXMBX3B.CHBOSTON.ORG-253Ed=BQIBaQc=qS4goWB T7poplM69zy_3xhKwEW14JZMSdioCoppxeFUr=fs67GvlGZstTpyIisCYNYmQCP6r0bcp KGd4f7d4gTaom=3aIq21IOPN1iyBDOER6I0oZo91kp0ZvFpxVqopVyOjMs=pwjK4pNoP HvoDDd9sK40bk0-_SOQ7MGiA1TNMLplMwIe= -- Thanks, Justin -- Justin