Alex, Russ, Sravani, Maren,

These are my notes from a November 1 prototyping episode:
Use case: count HERON patients by income group

We can now count, for example, how many patients of each sex are in various 
income groups:

INCOME_GROUP    SEX_CD  COUNT(PATIENT_NUM)
Over $0 f       11xxxx
Over $0 m       11xxxx
Over $35,000    f       13xxxx
Over $35,000    m       11xxxx
Over $42,000    f       18xxxx
Over $42,000    m       15xxxx
Over $52,000    f       10xxxx
Over $52,000    m       80xxx
Over $59,000    f       14xxxx
Over $59,000    m       11xxxx
Over $74,000    f       14xxxx
Over $74,000    m       11xxxx

Those groups come from a figure under ​CBO income growth 
study<https://en.wikipedia.org/wiki/Household_income_in_the_United_States#CBO_income_growth_study>
 in wikipedia and I coded them up as:

create materialized view household_income as
select zcta5, UHD001 dollars
     , case
       when UHD001 < 35000 then 0
       when UHD001 < 42000 then 35000
       when UHD001 < 52000 then 42000
       when UHD001 < 59000 then 52000
       when UHD001 < 74000 then 59000
       else                     74000
       end dollar_group
from acs_zcta_2
;


UHD001 is the ACS variable code for Median household income in the past 12 
months (in 2013 inflation-adjusted dollars). ZCTA5 is how they spell zip code.

acs_zcta_2 is an external table directly on top of the 137M compressed data 
file ge.00_file.dat.gz:

CREATE DIRECTORY GEO_CENSUS_STAGE AS '/d1/geo-census';
create directory staging_tools as '/d1/geo-census/tools';

create table acs_zcta_2 (
  FILEID VARCHAR2(6),
...
  ZCTA5 VARCHAR2(5),
...
  UHC017 INTEGER,
  UHD001 INTEGER,
  UHE001 INTEGER,
...
) organization external (
  type oracle_loader
  default directory geo_census_stage
  access parameters (
    records delimited by newline
    preprocessor staging_tools:'zcat.sh'
    fields lrtrim
    (
      FILEID position (1-6) char(6),
...
      ZCTA5 position (131-5) char(5),
...
      UHC017 position (16029-9) char(9) NULLIF UHC017 = '.',
      UHD001 position (16038-9) char(9) NULLIF UHD001 = '.',
      UHE001 position (16047-9) char(9) NULLIF UHE001 = '.',
...
    )
  )
  location ('ge.00_file.dat.gz')
)
;


It's the 3rd of 6 because Oracle is limited to 1000 columns per table and the 
ACS has some 4000 columns. (4 was probably enough, but MU used 6 so I followed 
suit. I don't have pentaho fired up to review the details of their code yet.)

IOU a copy (or commit) of the code that generates the create table ... 
statement from 
/d1/geo-census/mn-census-data/acs_20135a/index_of_data_fields__acs_20135a.csv. 
I attached the resulting 
​geo_acs_tables.sql<https://informatics.gpcnetwork.org/trac/Project/attachment/ticket/509/geo_acs_tables.sql>
 to 
​GPC:ticket:509<http://informatics.gpcnetwork.org/trac/Project//intertrac/ticket%3A509>,
 though I haven't sent any notice.

Then the query that generates the table above (from NHeronA1) is pretty 
straightforward:

select 'Over' || to_char(hi.dollar_group, '$999,999') income_group, sex_cd, 
count(patient_num)
from nightherondata.patient_dimension pat
join household_income hi on substr(zip_cd, 1, 5) = hi.zcta5
group by sex_cd, hi.dollar_group
having count(patient_num) > 100
order by dollar_group, sex_cd
;


Regarding the IOU, see:

https://github.com/dckc/i2b2-geo/blob/master/geo-rk2.ipynb 
6289b45<https://github.com/dckc/i2b2-geo/commit/6289b453fcb43566b2d402d873a9bbd8091bee9f>
https://github.com/dckc/i2b2-geo/blob/master/geo-rk2.py
linked (indirectly) from 
https://informatics.gpcnetwork.org/trac/Project/ticket/509

--
Dan

_______________________________________________
Gpc-dev mailing list
[email protected]
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Reply via email to