Greetings ctakes-dev! I have been polishing MedGen (UMLS) dictionaries for over a year now and I am confident in saying "this is solid". As a reminder, the medgen-mysql package contains a large subset of the UMLS that can be downloaded without UMLS login, greatly simplifying the creation of an example dictionary.
QUESTION: Would you like me to integrate this into ctakes to simplify installations for new-users, and if so, what would be your preferred method? Source Vocabularies (SAB) +-------------+--------+ | SourceVocab | cnt | +-------------+--------+ | MSH | 245435 | Medical Subject Headings | SNOMEDCT_US | 156105 | SNOMED Clinical Terms | NCI | 136888 | NCI Cancer Terms | ... | ... | +-------------+--------+ Semantic Types (STY) +-------------------------------------------+--------+ | SemanticType | cnt | +-------------------------------------------+--------+ | Pharmacologic Substance | 102511 | | Finding | 90413 | | Organic Chemical | 81329 | | Disease or Syndrome | 47223 | | Neoplastic Process | 16151 | | Amino Acid, Peptide, or Protein | 9383 | | Congenital Abnormality | 6536 | | Pathologic Function | 5655 | | Steroid | 3919 | | Sign or Symptom | 2909 | | ... | ... | What would you like to see? [email protected] On Nov 12, 2014, at 6:14 AM, "Dligach, Dmitriy" <[email protected]> wrote: > Andy, thank you for this resource! > > Do you have an estimate of what percentage of UMLS concepts were left out? > > Dima > > > > > On Nov 11, 2014, at 16:02, andy mcmurry <[email protected]> wrote: > >> Hello! >> >> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2) >> >> We just released a new library containing a huge chunk of UMLS concepts >> which are available without registering accounts/username/passwords. >> LEGALLY. Yes, really! >> >> The subset is from NCBI and it contains *thousands of concepts from SNOMED >> and other vocabularies*. >> >> The code is essentially >> 1. a list of WGET targets to various NCBI FTP site mirrors >> 2. Makefile for building the databases of interest >> >> Our legal team has approved distribution for Open Access work, ASL2 >> LICENSE. >> >> I recommend we use this opportunity to make this the default distribution >> for CTAKES UMLS connections, because it obviates the need for so much >> painful credentialing and back and forth agreements with the US National >> Library of Medicine. >> >> Cheers! >> --Andy >> >> >> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <[email protected]> >> wrote: >> >>> >>> I would love to see the install be as simple as apt-get install to end up >>> with some working dictionary that have more than a handful of entries to >>> get them started. >>> >>> Regards, >>> James Masanz >>> >>> -----Original Message----- >>> From: andy mcmurry [mailto:[email protected]] >>> Sent: Tuesday, September 09, 2014 4:32 PM >>> To: [email protected] >>> Subject: Recommendation for ctakes default (UMLS) dictionaries >>> >>> Greetings ctakes-dev: >>> >>> *UMLS license restrictions have been getting more lax over the years -- >>> *much of the UMLS can be downloaded directly from the NCBI official FTP >>> site. >>> >>> In fact, the NIH (and implicitly the NLM) *have already made the standard >>> terms public for some medical specialities*. >>> >>> For example: Here is the UMLS subset specific to Medical Genetics (MedGen) >>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names, >>> etc : >>> >>> [ ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html ] >>> >>> My team has developed a JVM based wrapper for MetaMap 2013AB which I >>> intend to open source soon (Clojure). It includes REST support for >>> invoking MetaMap with any or all of the command line arguments. >>> We do not integrate with UIMA, we are basically a wrapper around the >>> binary installation of MetaMap. The emphasis is on publication text not >>> clinical text, still, some services are common (such as LVG). >>> >>> Strangely, the NLM still requires UMLS licenses to download MetaMap >>> execution binaries. The MetaMap binary install is better but customizing >>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT >>> >>> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ] >>> >>> *** Hence, there is a real opportunity here to enable Apache cTAKES to >>> have a stronger default dictionary. ** * >>> >>> Imagine if we could >>> *$ apt-get install apache-ctakes * >>> >>> and instantly have a working package for SOME problem domain. >>> In my case (Medical Genetics) the UMLS definitions are already available >>> and the UMLS license problem becomes a non issue, at least for many first >>> time users >>> >>> Your thoughts? >>> AndyMC >>> >
