Thanks Dan - this is very helpful. We're going to start with a subset (200K encounters w/ related flowsheets) to test the process and performance. More questions may follow as we make our way down the rabbit hole.
Regards, Justin Dale Manager - Research Development and Support University of Minnesota Academic Health Center - Office of Information Systems Phone: 612.624.9747 Fax: 612.624.7458 [email protected] On Wed, Aug 5, 2015 at 11:10 AM, Dan Connolly <[email protected]> wrote: > 1. yes, multiselect was a pain. the relevant HERON code is > epic_flowsheets_multiselect.sql > <https://informatics.kumc.edu/work/browser/heron_load/epic_flowsheets_multiselect.sql> > > We mentioned this in our AMIA 2011 paper > <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3243191/>: > > Measures documented as structured choices allowing multiple responses > (e.g. “Dressing Status” is “clean dry and intact” and “reinforced”) are > stored as a concatenated string that must be separated into individual > observations within i2b2. > > 2. We started with discrete data only as well. > > We're now working on adding de-identified flowsheet notes to HERON > utilizing Medical College of Wisconsin's Unstructured Notes > Deidentification > <https://bitbucket.org/MCW_BMI/unstructured-notes-deidentification/overview> > methodology. Wish us luck... > > 3. Flowsheet observations dominate our i2b2 observation_fact table. 889M > in the McMillan Marsh release > <https://informatics.kumc.edu/work/blog/heron-mcmillan-marsh-update>. Our > typical ETL timing breakdown > <https://informatics.kumc.edu/work/wiki/HeronLoad#heron_load> shows > epic_flowsheets takes 15+ hours (out of something like 40 or 60). > > Our latest release has 939M flowsheet facts; the main flowsheet ETL task > took quite a bit more than 15 hrs: > > 2015-07-18 00:52:16 for 1 day, 11:34:00.379441 1 > load_epic_flowsheets > > Query time performance is pretty good with ~1.5B facts, though we do use > solid state storage and bitmapped indexes (i2b2_facts_index.sql > <https://informatics.kumc.edu/work/browser/heron_load/i2b2_facts_index.sql>). > My off-hand estimate is: 6 to 60seconds is typical, with more complex > queries taking longer. > > We have a real-time "how busy is heron?" page that shows the run-time of > recent queries as well as a performance regression suite that we run at > release time and weekly. I could get details from those if it's important. > > -- > Dan > > ------------------------------ > *From:* [email protected] [[email protected]] on behalf of Justin Dale [ > [email protected]] > *Sent:* Tuesday, August 04, 2015 10:10 PM > *To:* Dan Connolly > *Cc:* Tim Meyer > *Subject:* flowsheets in i2b2 > > Dan, > > Wondering if you had any words of wisdom to share for loading flowsheet > data in i2b2? We've drawn up plans for our environment but wanted to check > with you to see if there were any gotcha moments or valuable > lessons-learned. Couple of the items we need to address: > 1. custom list values for measure values: stored as semi-colon separated > list in the Clarity table when multiple values have been selected. we'll > need to parse each of these values and load as individual facts > 2. we're going to exclude any non-numeric and non custom list measure > values because being able to run a query to just check if the fact exists > or not does not seem valuable. > 3. not sure what to expect from i2b2 performance once we add nearly 3 > billion more rows to the database. considering starting with a subset of > 500K to both test our i2b2 loader and performance. > > Appreciate any information you can provide. Thanks Dan. > > > Regards, > > Justin Dale > Manager - Research Development and Support > University of Minnesota > Academic Health Center - Office of Information Systems > Phone: 612.624.9747 > Fax: 612.624.7458 > [email protected] >
_______________________________________________ Gpc-dev mailing list [email protected] http://listserv.kumc.edu/mailman/listinfo/gpc-dev
