HI Sarma, I launch multiple cpe processes in parallel to process large document collections (this is ctakes 2.5, pre-apache). Memory for the 'default' pipeline (AggregatePlaintextUMLSProcessor) is not an issue as long as you do *not* use lucene based dictionaries. I run 12 cpe's in parallel on a 16-core machine, each with a 1gb heap, no memory issues ( could bump that up to 16 cpes, but it's nice to let the machine have a little breathing space).
That being said, i don't know how memory intensive the models for coreference resolution and deep parsing are. -vj On Wed, Jan 16, 2013 at 8:50 PM, Chen, Pei <[email protected]>wrote: > Hi Sarma, > I encountered the same issue(s) with LVG with multiple threads in the same > JVM process. We've been scaling out by spawning off multiple pipelines in > different processes. > However, it would be interesting to see identified which components are > not thread safe and take advantage of spawning multiple components in the > same process. > Another area for optimization as you pointed out is the mem footprint. It > would be good if someone has a chance to profile the mem usage and see if > we could lower the footprint- my initial hunch is that all of the models > are loaded into memory as a cache. > If you're interested, feel free to open a Jira so it could be tracked you > could get credit for the contributions.. > -Pei > > > On Jan 16, 2013, at 5:49 PM, "Karthik Sarma" <[email protected]> wrote: > > > Hi folks, > > > > I know that the official position is that cTAKES is not thread-safe. I'm > > wondering, however, if anyone has looked into using multiple processing > > pipelines (via the processingUnitThreadCount directive in a CPE > descriptor > > and documenting where the thread safety problems lie. > > > > I've given it a bit of a try, and on first glance the biggest issue seems > > to be in the LVG api, which isn't at all thread-safe (they seem to claim > > that it would be thread-safe so long as API instances are not shared, but > > that doesn't seem prima facie true since it throws errors when multiple > > pipelines are used, which *should* be creating multiple LVG api > instances). > > > > I haven't found any other serious issues, but perhaps you folks might be > > familiar with some. > > > > There is, of course, the memory issue -- cTAKES' memory footprint alone > on > > my machine with a single pipeline and using a mysql umls database is over > > 2GB; this is presumably the cost of each pipeline, though I can't > actually > > really figure out what all that memory is being used for since none of > the > > in-memory DBs and indexes used seem to be anywhere near that size. > > > > It is, of course, possible to split datasets and simply run multiple > > processes, but my feeling is that there must be a lot of unnecessary > > overhead there since all the operations we actually do (other than the > CAS > > consumers) are read-only. It seems to me that cTAKES ought to be limited > > only by disk/memory throughput and total CPU capacity because of the > nature > > of the load... > > > > Anyway, if anyone else has thoughts, I'd be interested. This is something > > I'd be interested in taking a stab at resolving, since I've been poking > > around in this direction behind the scenes for some time now. My group > has > > access to huge databases but limited computational resources, and I'd > like > > to make the most of what we've got! > > > > Karthik > > > > > > -- > > Karthik Sarma > > UCLA Medical Scientist Training Program Class of 20?? > > Member, UCLA Medical Imaging & Informatics Lab > > Member, CA Delegation to the House of Delegates of the American Medical > > Association > > [email protected] > > gchat: [email protected] > > linkedin: www.linkedin.com/in/ksarma >
