Hi George,

The two variables you mentioned in your message are the ones I've been using; 
the data set we were provided included a single sequence number that was 
populated by NAACCR Item 380 or 560, depending on the participating site. (Some 
sites use Central while others use Hospital Sequence Number.)

I'm not certain what you mean about making each tumor unique. To be clear, each 
tumor in the dataset is not given its very own sequence number. The numbers can 
repeat across patients, but they should not repeat within a given patient. 
Therefore, each tumor can be uniquely identified using Study ID and sequence 
number.

Hope this helps...

Brad

-----Original Message-----
From: Kowalski, George [mailto:[email protected]] 
Sent: Tuesday, February 02, 2016 2:34 PM
To: McDowell, Bradley D
Cc: [email protected]
Subject: Re: data issues

Bradley,

W   When you say there are “ Duplicate records (indicated with equivalent 
sequence number”, what field are you basing this off of ? 
http://www.naaccr.org/Applications/ContentReader/default.aspx?c=9 shows  only 
two sequence numbers , both with not enough room to make each tumor unique .

[cid:B0AA3B86-D2F3-40F2-B462-3E055238E77C]

and

[cid:A6D5DCC0-5C33-46C4-8DD1-08D83F723388]

George Kowalski
 414.805.7318 (office) / [email protected]<mailto:[email protected]>

From: 
<[email protected]<mailto:[email protected]>> 
on behalf of "McDowell, Bradley D" 
<[email protected]<mailto:[email protected]>>
Date: Monday, February 1, 2016 at 11:00 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: FW: data issues

Dan asked me to forward this message to this group:

From: McDowell, Bradley D
Sent: Tuesday, January 26, 2016 11:00 AM
To: Dan Connolly ([email protected]<mailto:[email protected]>)
Cc: Chrischilles, Elizabeth A; Gryzlak, Brian M
Subject: data issues

Hi Dan,

Betsy asked me to provide a list of the issues that have been uncovered so far 
with respect to the oncology registry data. The hope is that we can establish 
tickets for each problem. There are some problems that are easy to describe, 
and some that are not so easy. Some of the easy ones:


·         Missing MCW patient 
(https://informatics.gpcnetwork.org/trac/Project/ticket/453)

·         Duplicate records (indicated with equivalent sequence number for same 
Study ID), some with updated surgery, class of case variables

·         Some duplicate records have dx dates that appear to have been copied 
from last contact date (dx date not the same across duplicates)

·         UMN switched | and : for NAACCR variables (not “*.Descriptor” 
variables)

For the duplicated records, I have put together a spreadsheet that nicely 
illustrates the problem, and I’m happy to share that. We’ll have to transfer it 
via redcap or some other secure means since it contains patient level data.

Regarding the not so easy issues:

·         One problem concerns inconsistencies in coded values. For example, 
gpc_language has four different values for “English”. In general, UIOWA is not 
using the same descriptor values as other sites, and that accounts for most of 
these. It is not the only offender, however. MCW uses a different convention 
for seer_site_breast (as does UIOWA) and Race descriptors are different for 
UIOWA and WISC. These inconsistencies have percolated through to the derived 
GPC variables. I am writing a mapping program to handle this with the registry 
data we have received so far. I’m certainly willing to share what I have if it 
would help you.

·         Another big problem concerns missing values. I have attached a report 
that provides the percentage of missing values, organized by site and variable. 
This illustrates, for example, that UIOWA has no data for the Race 5 variable 
(i.e., 100% of the values equal “NA”; this does NOT reflect cases where a value 
is assigned for the NAACCR code for ‘missing’). It also illustrates some other 
things that we have discussed; for example, sites that reported data for 
central sequence number did not report data for hospital sequence number (and 
vice versa).

o   UIOWA and MCRF appear to have the biggest problems with missing data.

·         We also need to figure out why so many patients in our database do 
not appear to have tumors diagnosed between 01JAN2013 and 01MAY2014.

(General observation: You’ll notice that each of the NAACCR concepts correspond 
to two variables (e.g., N0670_Surg_Prim_Site and N0670_Surg_Prim_Site_D). Vince 
and I settled on this arrangement for the datamart. Since then, though, I’ve 
come to believe that the redundancy makes the database difficult to use. 
Perhaps we could keep that in mind for future data cuts.)

I’m very happy to work on these problems with you. Would you like to schedule a 
phone call to plan out how to approach these issues?

Thanks,

Brad

------------------------------------------------------------
Bradley D. McDowell, Ph.D.
Director, Population Research Core
Holden Comprehensive Cancer Center

5240 MERF | The University of Iowa | Iowa City, IA | 52242
Office: 319-384-1768

_______________________________________________
Gpc-dev mailing list
[email protected]
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Reply via email to