Re: Multiple processing pipelines for cTAKES

Karthik Sarma Thu, 17 Jan 2013 12:16:51 -0800

Hi,

Thanks Kim! I've been working on something similar myself, so I'll just go
ahead and combine the patches today and do some preliminary testing on one
of my datasets to see if the output is equivalent.


Vijay -- that's quite interesting. I'm pretty sure I'm not actually using
any lucene... I'm using DictionaryLookupAnnotatorDB configured for a local
tokenized UMLS install (with a snomed map table), and I've even gone so far
as to comment out everything related to the lucene RXNORM/Orange Book
dictionaries in both that file as well as LookupDesc_Db even though I'm
pretty sure that those dictionaries are tiny. Even so, my footprint is
above 2GB. I'll have to take a look to see if Pei is right about the models
chewing up all the memory.

I suppose that one possibility is that for some reason using the "UMLS"
pipeline (with the web API) instead of the DB pipeline (with a local
install) has a much smaller memory imprint. I've found that using the UMLS
pipeline slows things down considerably for me, presumably because the
limiting factor becomes the web API throughput. Running a bunch of them at
once would certainly mitigate this factor, but I would think that running a
bunch against a local DB would be faster still.

Karthik





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
[email protected]
gchat: [email protected]
linkedin: www.linkedin.com/in/ksarma


On Thu, Jan 17, 2013 at 9:43 AM, Kim Ebert
<[email protected]>wrote:

> Hi Sarma and Pei,
>
> It appears LVG is using static variables for basic string functions.
>
> I've attached a patch that may allow multiple instances to be run in
> parallel; however the library is still not thread safe. I.E. you can't have
> multiple threads using the same instance.
>
> I haven't done adequate testing to see if this solves the entire problem,
> so use at your own risk.
>
> The source code this patch applies to is available here.
>
> http://lexsrv3.nlm.nih.gov/**LexSysGroup/Projects/lvg/2010/**
> release/lvg2010.tgz<http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2010/release/lvg2010.tgz>
>
> Let me know how this works for you.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.**com/ <http://www.perfectsearchcorp.com/>
>
>
>
> On 01/16/2013 06:50 PM, Chen, Pei wrote:
>
>> Hi Sarma,
>> I encountered the same issue(s) with LVG with multiple threads in the
>> same JVM process. We've been scaling out by spawning off multiple pipelines
>> in different processes.
>> However, it would be interesting to see identified which components are
>> not thread safe and take advantage of spawning multiple components in the
>> same process.
>> Another area for optimization as you pointed out is the mem footprint. It
>> would be good if someone has a chance to profile the mem usage and see if
>> we could lower the footprint- my initial hunch is that all of the models
>> are loaded into memory as a cache.
>> If you're interested, feel free to open a Jira so it could be tracked you
>> could get credit for the contributions..
>> -Pei
>>
>>
>> On Jan 16, 2013, at 5:49 PM, "Karthik Sarma"<[email protected]>  wrote:
>>
>>  Hi folks,
>>>
>>> I know that the official position is that cTAKES is not thread-safe. I'm
>>> wondering, however, if anyone has looked into using multiple processing
>>> pipelines (via the processingUnitThreadCount directive in a CPE
>>> descriptor
>>> and documenting where the thread safety problems lie.
>>>
>>> I've given it a bit of a try, and on first glance the biggest issue seems
>>> to be in the LVG api, which isn't at all thread-safe (they seem to claim
>>> that it would be thread-safe so long as API instances are not shared, but
>>> that doesn't seem prima facie true since it throws errors when multiple
>>> pipelines are used, which *should* be creating multiple LVG api
>>> instances).
>>>
>>> I haven't found any other serious issues, but perhaps you folks might be
>>> familiar with some.
>>>
>>> There is, of course, the memory issue -- cTAKES' memory footprint alone
>>> on
>>> my machine with a single pipeline and using a mysql umls database is over
>>> 2GB; this is presumably the cost of each pipeline, though I can't
>>> actually
>>> really figure out what all that memory is being used for since none of
>>> the
>>> in-memory DBs and indexes used seem to be anywhere near that size.
>>>
>>> It is, of course, possible to split datasets and simply run multiple
>>> processes, but my feeling is that there must be a lot of unnecessary
>>> overhead there since all the operations we actually do (other than the
>>> CAS
>>> consumers) are read-only. It seems to me that cTAKES ought to be limited
>>> only by disk/memory throughput and total CPU capacity because of the
>>> nature
>>> of the load...
>>>
>>> Anyway, if anyone else has thoughts, I'd be interested. This is something
>>> I'd be interested in taking a stab at resolving, since I've been poking
>>> around in this direction behind the scenes for some time now. My group
>>> has
>>> access to huge databases but limited computational resources, and I'd
>>> like
>>> to make the most of what we've got!
>>>
>>> Karthik
>>>
>>>
>>> --
>>> Karthik Sarma
>>> UCLA Medical Scientist Training Program Class of 20??
>>> Member, UCLA Medical Imaging&  Informatics Lab
>>>
>>> Member, CA Delegation to the House of Delegates of the American Medical
>>> Association
>>> [email protected]
>>> gchat: [email protected]
>>> linkedin: www.linkedin.com/in/ksarma
>>>
>>

Re: Multiple processing pipelines for cTAKES

Reply via email to