Karthik,
I was wondering if you have had any success in combining the patches?
Was the output equivalent?
Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/
On 01/17/2013 01:15 PM, Karthik Sarma wrote:
Hi,
Thanks Kim! I've been working on something similar myself, so I'll just go
ahead and combine the patches today and do some preliminary testing on one
of my datasets to see if the output is equivalent.
Vijay -- that's quite interesting. I'm pretty sure I'm not actually using
any lucene... I'm using DictionaryLookupAnnotatorDB configured for a local
tokenized UMLS install (with a snomed map table), and I've even gone so far
as to comment out everything related to the lucene RXNORM/Orange Book
dictionaries in both that file as well as LookupDesc_Db even though I'm
pretty sure that those dictionaries are tiny. Even so, my footprint is
above 2GB. I'll have to take a look to see if Pei is right about the models
chewing up all the memory.
I suppose that one possibility is that for some reason using the "UMLS"
pipeline (with the web API) instead of the DB pipeline (with a local
install) has a much smaller memory imprint. I've found that using the UMLS
pipeline slows things down considerably for me, presumably because the
limiting factor becomes the web API throughput. Running a bunch of them at
once would certainly mitigate this factor, but I would think that running a
bunch against a local DB would be faster still.
Karthik
--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging& Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
[email protected]
gchat: [email protected]
linkedin: www.linkedin.com/in/ksarma
On Thu, Jan 17, 2013 at 9:43 AM, Kim Ebert
<[email protected]>wrote:
Hi Sarma and Pei,
It appears LVG is using static variables for basic string functions.
I've attached a patch that may allow multiple instances to be run in
parallel; however the library is still not thread safe. I.E. you can't have
multiple threads using the same instance.
I haven't done adequate testing to see if this solves the entire problem,
so use at your own risk.
The source code this patch applies to is available here.
http://lexsrv3.nlm.nih.gov/**LexSysGroup/Projects/lvg/2010/**
release/lvg2010.tgz<http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2010/release/lvg2010.tgz>
Let me know how this works for you.
Thanks,
Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.**com/<http://www.perfectsearchcorp.com/>
On 01/16/2013 06:50 PM, Chen, Pei wrote:
Hi Sarma,
I encountered the same issue(s) with LVG with multiple threads in the
same JVM process. We've been scaling out by spawning off multiple pipelines
in different processes.
However, it would be interesting to see identified which components are
not thread safe and take advantage of spawning multiple components in the
same process.
Another area for optimization as you pointed out is the mem footprint. It
would be good if someone has a chance to profile the mem usage and see if
we could lower the footprint- my initial hunch is that all of the models
are loaded into memory as a cache.
If you're interested, feel free to open a Jira so it could be tracked you
could get credit for the contributions..
-Pei
On Jan 16, 2013, at 5:49 PM, "Karthik Sarma"<[email protected]> wrote:
Hi folks,
I know that the official position is that cTAKES is not thread-safe. I'm
wondering, however, if anyone has looked into using multiple processing
pipelines (via the processingUnitThreadCount directive in a CPE
descriptor
and documenting where the thread safety problems lie.
I've given it a bit of a try, and on first glance the biggest issue seems
to be in the LVG api, which isn't at all thread-safe (they seem to claim
that it would be thread-safe so long as API instances are not shared, but
that doesn't seem prima facie true since it throws errors when multiple
pipelines are used, which *should* be creating multiple LVG api
instances).
I haven't found any other serious issues, but perhaps you folks might be
familiar with some.
There is, of course, the memory issue -- cTAKES' memory footprint alone
on
my machine with a single pipeline and using a mysql umls database is over
2GB; this is presumably the cost of each pipeline, though I can't
actually
really figure out what all that memory is being used for since none of
the
in-memory DBs and indexes used seem to be anywhere near that size.
It is, of course, possible to split datasets and simply run multiple
processes, but my feeling is that there must be a lot of unnecessary
overhead there since all the operations we actually do (other than the
CAS
consumers) are read-only. It seems to me that cTAKES ought to be limited
only by disk/memory throughput and total CPU capacity because of the
nature
of the load...
Anyway, if anyone else has thoughts, I'd be interested. This is something
I'd be interested in taking a stab at resolving, since I've been poking
around in this direction behind the scenes for some time now. My group
has
access to huge databases but limited computational resources, and I'd
like
to make the most of what we've got!
Karthik
--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging& Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
[email protected]
gchat: [email protected]
linkedin: www.linkedin.com/in/ksarma