OK, I got the answer to my question (“Do we exclude free text fields?”), which
is “Yes.” Thx.
Now, your questions: Our basic ETL to get UW NAACCR data into the database is
in Informatica, and mostly resolves differences between UW field names and
official NAACCR field names. I also do the de-identification step in
Informatica. From then on I largely follow the naaccr_txform script. I used
your code to include/exclude NAACCR sections and items based on numerous
criteria, but I don’t believe I see item format of “free text” as one of the
criteria. So my fields included 310 and 320, which I can easily fix.
I did vary somewhat in an attempt to build in some flexibility. Results are
essentially the same as your system, so this may be largely academic, but you
asked, so:
Where I varied was an attempt to set up my NAACCR metadata in such a way that I
wouldn’t have to hard code which NAACCR sections and items were included in the
EAV. To do so I added a logical “ONTOLOGY” field to the section and item
tables like so (apologies for formatting):
DROP TABLE naaccr_tblSection ;
CREATE TABLE naaccr_tblSection
(
Section_ID integer not null primary key
, Section varchar(46)
, SectionPathName varchar(212)
, SECTION_ONTOLOGY varchar(1)
, SectionBaseCode varchar(46)
);
DROP TABLE naaccr_tblItem ;
CREATE TABLE naaccr_tblItem
(
ItemID integer not null primary key,
ItemNbr integer,
ItemName varchar(512),
AllowValue
varchar(204),
ItemPathName varchar(212),
FieldLength integer,
SectionID integer,
ItemFormat varchar(125),
Item_Ontology varchar(1)
, ItemBaseCode
varchar(212)
);
DROP TABLE naaccr_tblCode ;
CREATE TABLE naaccr_tblCode
(
CodeID integer not null primary key,
ItemID integer,
CodeNbr varchar(212),
CodePath varchar(212),
CodeDcrp varchar(198)
);
/* To see what NAACCR items are included in our extract, run: */
create or replace view vw_i2b2_naaccr_active_items as
(select s.section_id, s.section, i.itemid, i.itemnbr, i.itemname,
i.itempathname, i.ITEMFORMAT
from naaccr_tblitem i
join naaccr_tblsection s
on i.SECTIONID = s.SECTION_ID
where i.ITEM_ONTOLOGY = 'Y'
AND s.SECTION_ONTOLOGY = 'Y') ;
After I load the NAACCR tables, I set the initial value for the ONTOLOGY fields
according to the criteria in naaccr_txform.sql.
So when creating my extract_eav equivalent, instead of the hard coded section
selection
“and ns.SectionID in (
1 -- Cancer Identification
, 2 – Demographic
, etc…”
I select sections and items like so:
“ from
naaccr_tblitem i,
naaccr_tblsection s
where i.sectionID = s.section_ID
and s.section_id is not null
and s.section_ontology = 'Y'
and i.item_ontology = 'Y'
and i.FIELDLENGTH is not null
and i.ITEMBASECODE is not null -- another way of excluding a field/item
order by s.section_id, i.ITEMNBR “
So, the main advantage is I can select a full section while excluding selected
items in the section.
From: Dan Connolly [mailto:[email protected]]
Sent: Monday, March 09, 2015 9:25 AM
To: Lenon Patrick; [email protected]
Subject: RE: NAACCR - Free text fields
Perhaps you could back up and explain your overall approach to the NAACCR ETL?
What code are you using? Where can I look at it? Perhaps you've discussed this
before, but I don't see any pointers to context in this message. Did you try
the HERON code? If not, why not? If so, what happened when you tried?
As to this specific question, it's documented on the
TumorRegistry<https://informatics.kumc.edu/work/wiki/TumorRegistry> page:
We reviewed the data we get by section to eliminate potentially sensitive data,
including free-text; the sections with a -- below are not loaded into HERON:
followed the relevant code excerpt from
source:heron_load/naaccr_txform.sql#L67<https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql#L67><https://informatics.kumc.edu/work/export/HEAD/heron_load/naaccr_txform.sql#L67>
--
Dan
________________________________
From:
[email protected]<mailto:[email protected]>
[[email protected]] on behalf of Lenon Patrick
[[email protected]]
Sent: Monday, March 09, 2015 9:17 AM
To: [email protected]<mailto:[email protected]>
Subject: NAACCR - Free text fields
In trying to ensure all my Tumor Registry fact table items have corresponding
concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual
Occupation) which has a format of “Free text.” As you’d expect there are no
entries in the NAACCR metadata for that. However, following the Heron fact
load code, I created a whole bunch of facts with concept codes like
“NAACCR|310:(n)TH GRADE TEACHER – (small Wisconsin town) SCHOOL DISTRICT”
To my semi-trained eye this looks like it would be pretty useless to I2B2. So
I’m wondering what other sites do in similar situations. Possibilities that
have occurred to me already are:
1) Exclude all “free text” format fields from the fact load.
2) Leave them in, hoping for codification someday
Is there any reason NOT to exclude free text fields? Or some criteria to
include some and exclude others?
Patrick Lenon
HIMC Informatics Specialist
608 890 5671
_______________________________________________
Gpc-dev mailing list
[email protected]
http://listserv.kumc.edu/mailman/listinfo/gpc-dev