I am starting to think that the pipelines dont quite work the way I thought, but have not yet had a chance to run down what is going on. Will keep you posted or happy to work together at your convenience
On Wednesday, January 23, 2013, Kim Ebert wrote: > Karthik, > > I was wondering if you have had any success in combining the patches? Was > the output equivalent? > > Kim Ebert > 1.801.669.7342 > Perfect Search Corp > http://www.perfectsearchcorp.**com/ <http://www.perfectsearchcorp.com/> > > > On 01/17/2013 01:15 PM, Karthik Sarma wrote: > >> Hi, >> >> Thanks Kim! I've been working on something similar myself, so I'll just go >> ahead and combine the patches today and do some preliminary testing on one >> of my datasets to see if the output is equivalent. >> >> Vijay -- that's quite interesting. I'm pretty sure I'm not actually using >> any lucene... I'm using DictionaryLookupAnnotatorDB configured for a local >> tokenized UMLS install (with a snomed map table), and I've even gone so >> far >> as to comment out everything related to the lucene RXNORM/Orange Book >> dictionaries in both that file as well as LookupDesc_Db even though I'm >> pretty sure that those dictionaries are tiny. Even so, my footprint is >> above 2GB. I'll have to take a look to see if Pei is right about the >> models >> chewing up all the memory. >> >> I suppose that one possibility is that for some reason using the "UMLS" >> pipeline (with the web API) instead of the DB pipeline (with a local >> install) has a much smaller memory imprint. I've found that using the UMLS >> pipeline slows things down considerably for me, presumably because the >> limiting factor becomes the web API throughput. Running a bunch of them at >> once would certainly mitigate this factor, but I would think that running >> a >> bunch against a local DB would be faster still. >> >> Karthik >> >> >> >> >> >> -- >> Karthik Sarma >> UCLA Medical Scientist Training Program Class of 20?? >> Member, UCLA Medical Imaging& Informatics Lab >> Member, CA Delegation to the House of Delegates of the American Medical >> Association >> [email protected] >> gchat: [email protected] >> linkedin: www.linkedin.com/in/ksarma >> >> >> On Thu, Jan 17, 2013 at 9:43 AM, Kim Ebert >> <[email protected]>wrote: >> >> Hi Sarma and Pei, >>> >>> It appears LVG is using static variables for basic string functions. >>> >>> I've attached a patch that may allow multiple instances to be run in >>> parallel; however the library is still not thread safe. I.E. you can't >>> have >>> multiple threads using the same instance. >>> >>> I haven't done adequate testing to see if this solves the entire problem, >>> so use at your own risk. >>> >>> The source code this patch applies to is available here. >>> >>> http://lexsrv3.nlm.nih.gov/****LexSysGroup/Projects/lvg/2010/****<http://lexsrv3.nlm.nih.gov/**LexSysGroup/Projects/lvg/2010/**> >>> release/lvg2010.tgz<http://**lexsrv3.nlm.nih.gov/** >>> LexSysGroup/Projects/lvg/2010/**release/lvg2010.tgz<http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2010/release/lvg2010.tgz> >>> > >>> >>> Let me know how this works for you. >>> >>> Thanks, >>> >>> Kim Ebert >>> 1.801.669.7342 >>> Perfect Search Corp >>> http://www.perfectsearchcorp.****com/<http://www.** >>> perfectsearchcorp.com/ <http://www.perfectsearchcorp.com/>> >>> >>> >>> >>> On 01/16/2013 06:50 PM, Chen, Pei wrote: >>> >>> Hi Sarma, >>>> I encountered the same issue(s) with LVG with multiple threads in the >>>> same JVM process. We've been scaling out by spawning off multiple >>>> pipelines >>>> in different processes. >>>> However, it would be interesting to see identified which components are >>>> not thread safe and take advantage of spawning multiple components in >>>> the >>>> same process. >>>> Another area for optimization as you pointed out is the mem footprint. >>>> It >>>> would be good if someone has a chance to profile the mem usage and see >>>> if >>>> we could lower the footprint- my initial hunch is that all of the models >>>> are loaded into memory as a cache. >>>> If you're interested, feel free to open a Jira so it could be tracked >>>> you >>>> could get credit for the contributions.. >>>> -Pei >>>> >>>> >>>> On Jan 16, 2013, at 5:49 PM, "Karthik Sarma"<[email protected]> >>>> wrote: >>>> >>>> Hi folks, >>>> >>>>> I know that the official position is that cTAKES is not thread-safe. >>>>> I'm >>>>> wondering, however, if anyone has looked into using multiple processing >>>>> pipelines (via the processingUnitThreadCount directive in a CPE >>>>> descriptor >>>>> and documenting where the thread safety problems lie. >>>>> >>>>> I've given it a bit of a try, and on first glance the biggest issue >>>>> seems >>>>> to be in the LVG api, which isn't at all thread-safe (they seem to >>>>> claim >>>>> that it would be thread-safe so long as API instances are not shared, >>>>> but >>>>> that doesn't seem prima facie true since it throws errors when multiple >>>>> pipelines are used, which *should* be creating multiple LVG api >>>>> instances). >>>>> >>>>> I haven't found any other serious issues, but perhaps you folks might >>>>> be >>>>> familiar with some. >>>>> >>>>> There is, of course, the memory issue -- cTAKES' memory footprint alone >>>>> on >>>>> my machine with a single pipeline and using a mysql umls database is >>>>> over >>>>> 2GB; this is presumably the cost of each pipeline, though I can't >>>>> actually >>>>> really figure out what all that memory is being used for since none of >>>>> the >>>>> in-memory DBs and indexes used seem to be anywhere near that size. >>>>> >>>>> It is, of course, possible to split datasets and simply run multiple >>>>> processes, but my feeling is that there must be a lot of unnecessary >>>>> overhead there since all the operations we actually do (other than the >>>>> CAS >>>>> consumers) are read-only. It seems to me that cTAKES ought to be >>>>> limited >>>>> only by disk/memory throughput and total CPU capacity because of the >>>>> nature >>>>> of the load... >>>>> >>>>> Anyway, if anyone else has thoughts, I'd be interested. This is >>>>> something >>>>> I'd be interested in taking a stab at resolving, since I've been poking >>>>> around in this direction behind the scenes for some time now. My group >>>>> has >>>>> access to huge databases but limited computational resources, and I'd >>>>> like >>>>> to make the most of what we've got! >>>>> >>>>> Karthik >>>>> >>>>> >>>>> -- >>>>> Karthik Sarma >>>>> UCLA Medical Scientist Training Program Class of 20?? >>>>> Member, UCLA Medical Imaging& Informatics Lab >>>>> >>>>> Member, CA Delegation to the House of Delegates of the American Medical >>>>> Association >>>>> [email protected] >>>>> gchat: [email protected] >>>>> linkedin: www.linkedin.com/in/ksarma >>>>> >>>>> -- Sent from Gmail Mobile
