#56: shareable synthetic test data sets: Epic clarity, i2b2, NAACCR, ...
--------------------------+----------------------------
Reporter: dconnolly | Owner: bos
Type: enhancement | Status: assigned
Priority: minor | Milestone: data-domains3
Component: data-sharing | Resolution:
Keywords: | Blocked By:
Blocking: |
--------------------------+----------------------------
Comment (by dconnolly):
While testing NAACCR ETL refinements for #258 and related stuff, I'm
reminded that we don't have much test data. I'm interested to pursue the
idea of characterizing existing data and synthesizing data based on those
characteristics.
- For nominal data, calculate frequencies and use the frequencies to
pick random values
- For numeric measures, assume a normal distribution
- For dates, treat them as numeric measures by subtracting date of
diagnosis
for bonus points:
- use primary site, stage, age, sex to influence probabilities of other
data
- for text, use trigrams and learn about hidden Markov models
--
Ticket URL:
<http://informatics.gpcnetwork.org/trac/Project/ticket/56#comment:28>
gpc-informatics <http://informatics.gpcnetwork.org/>
Greater Plains Network - Informatics
_______________________________________________
Gpc-dev mailing list
[email protected]
http://listserv.kumc.edu/mailman/listinfo/gpc-dev