Re: Multiple processing pipelines for cTAKES

vijay garla Thu, 17 Jan 2013 07:31:07 -0800

HI Sarma,

I launch multiple cpe processes in parallel to process large document
collections (this is ctakes 2.5, pre-apache).  Memory for the 'default'
pipeline (AggregatePlaintextUMLSProcessor) is not an issue as long as you
do *not* use lucene based dictionaries.  I run 12 cpe's in parallel on a
16-core machine, each with a 1gb heap, no memory issues ( could bump that
up to 16 cpes, but it's nice to let the machine have a little breathing
space).


That being said, i don't know how memory intensive the models for
coreference resolution and deep parsing are.

-vj


On Wed, Jan 16, 2013 at 8:50 PM, Chen, Pei
<[email protected]>wrote:

> Hi Sarma,
> I encountered the same issue(s) with LVG with multiple threads in the same
> JVM process. We've been scaling out by spawning off multiple pipelines in
> different processes.
> However, it would be interesting to see identified which components are
> not thread safe and take advantage of spawning multiple components in the
> same process.
> Another area for optimization as you pointed out is the mem footprint. It
> would be good if someone has a chance to profile the mem usage and see if
> we could lower the footprint- my initial hunch is that all of the models
> are loaded into memory as a cache.
> If you're interested, feel free to open a Jira so it could be tracked you
> could get credit for the contributions..
> -Pei
>
>
> On Jan 16, 2013, at 5:49 PM, "Karthik Sarma" <[email protected]> wrote:
>
> > Hi folks,
> >
> > I know that the official position is that cTAKES is not thread-safe. I'm
> > wondering, however, if anyone has looked into using multiple processing
> > pipelines (via the processingUnitThreadCount directive in a CPE
> descriptor
> > and documenting where the thread safety problems lie.
> >
> > I've given it a bit of a try, and on first glance the biggest issue seems
> > to be in the LVG api, which isn't at all thread-safe (they seem to claim
> > that it would be thread-safe so long as API instances are not shared, but
> > that doesn't seem prima facie true since it throws errors when multiple
> > pipelines are used, which *should* be creating multiple LVG api
> instances).
> >
> > I haven't found any other serious issues, but perhaps you folks might be
> > familiar with some.
> >
> > There is, of course, the memory issue -- cTAKES' memory footprint alone
> on
> > my machine with a single pipeline and using a mysql umls database is over
> > 2GB; this is presumably the cost of each pipeline, though I can't
> actually
> > really figure out what all that memory is being used for since none of
> the
> > in-memory DBs and indexes used seem to be anywhere near that size.
> >
> > It is, of course, possible to split datasets and simply run multiple
> > processes, but my feeling is that there must be a lot of unnecessary
> > overhead there since all the operations we actually do (other than the
> CAS
> > consumers) are read-only. It seems to me that cTAKES ought to be limited
> > only by disk/memory throughput and total CPU capacity because of the
> nature
> > of the load...
> >
> > Anyway, if anyone else has thoughts, I'd be interested. This is something
> > I'd be interested in taking a stab at resolving, since I've been poking
> > around in this direction behind the scenes for some time now. My group
> has
> > access to huge databases but limited computational resources, and I'd
> like
> > to make the most of what we've got!
> >
> > Karthik
> >
> >
> > --
> > Karthik Sarma
> > UCLA Medical Scientist Training Program Class of 20??
> > Member, UCLA Medical Imaging & Informatics Lab
> > Member, CA Delegation to the House of Delegates of the American Medical
> > Association
> > [email protected]
> > gchat: [email protected]
> > linkedin: www.linkedin.com/in/ksarma
>

Re: Multiple processing pipelines for cTAKES

Reply via email to