Alex, Russ, Sravani, Maren, These are my notes from a November 1 prototyping episode: Use case: count HERON patients by income group
We can now count, for example, how many patients of each sex are in various income groups: INCOME_GROUP SEX_CD COUNT(PATIENT_NUM) Over $0 f 11xxxx Over $0 m 11xxxx Over $35,000 f 13xxxx Over $35,000 m 11xxxx Over $42,000 f 18xxxx Over $42,000 m 15xxxx Over $52,000 f 10xxxx Over $52,000 m 80xxx Over $59,000 f 14xxxx Over $59,000 m 11xxxx Over $74,000 f 14xxxx Over $74,000 m 11xxxx Those groups come from a figure under CBO income growth study<https://en.wikipedia.org/wiki/Household_income_in_the_United_States#CBO_income_growth_study> in wikipedia and I coded them up as: create materialized view household_income as select zcta5, UHD001 dollars , case when UHD001 < 35000 then 0 when UHD001 < 42000 then 35000 when UHD001 < 52000 then 42000 when UHD001 < 59000 then 52000 when UHD001 < 74000 then 59000 else 74000 end dollar_group from acs_zcta_2 ; UHD001 is the ACS variable code for Median household income in the past 12 months (in 2013 inflation-adjusted dollars). ZCTA5 is how they spell zip code. acs_zcta_2 is an external table directly on top of the 137M compressed data file ge.00_file.dat.gz: CREATE DIRECTORY GEO_CENSUS_STAGE AS '/d1/geo-census'; create directory staging_tools as '/d1/geo-census/tools'; create table acs_zcta_2 ( FILEID VARCHAR2(6), ... ZCTA5 VARCHAR2(5), ... UHC017 INTEGER, UHD001 INTEGER, UHE001 INTEGER, ... ) organization external ( type oracle_loader default directory geo_census_stage access parameters ( records delimited by newline preprocessor staging_tools:'zcat.sh' fields lrtrim ( FILEID position (1-6) char(6), ... ZCTA5 position (131-5) char(5), ... UHC017 position (16029-9) char(9) NULLIF UHC017 = '.', UHD001 position (16038-9) char(9) NULLIF UHD001 = '.', UHE001 position (16047-9) char(9) NULLIF UHE001 = '.', ... ) ) location ('ge.00_file.dat.gz') ) ; It's the 3rd of 6 because Oracle is limited to 1000 columns per table and the ACS has some 4000 columns. (4 was probably enough, but MU used 6 so I followed suit. I don't have pentaho fired up to review the details of their code yet.) IOU a copy (or commit) of the code that generates the create table ... statement from /d1/geo-census/mn-census-data/acs_20135a/index_of_data_fields__acs_20135a.csv. I attached the resulting geo_acs_tables.sql<https://informatics.gpcnetwork.org/trac/Project/attachment/ticket/509/geo_acs_tables.sql> to GPC:ticket:509<http://informatics.gpcnetwork.org/trac/Project//intertrac/ticket%3A509>, though I haven't sent any notice. Then the query that generates the table above (from NHeronA1) is pretty straightforward: select 'Over' || to_char(hi.dollar_group, '$999,999') income_group, sex_cd, count(patient_num) from nightherondata.patient_dimension pat join household_income hi on substr(zip_cd, 1, 5) = hi.zcta5 group by sex_cd, hi.dollar_group having count(patient_num) > 100 order by dollar_group, sex_cd ; Regarding the IOU, see: https://github.com/dckc/i2b2-geo/blob/master/geo-rk2.ipynb 6289b45<https://github.com/dckc/i2b2-geo/commit/6289b453fcb43566b2d402d873a9bbd8091bee9f> https://github.com/dckc/i2b2-geo/blob/master/geo-rk2.py linked (indirectly) from https://informatics.gpcnetwork.org/trac/Project/ticket/509 -- Dan
_______________________________________________ Gpc-dev mailing list [email protected] http://listserv.kumc.edu/mailman/listinfo/gpc-dev
