Progress on the CDM v3 query

Nadkarni, Prakash Fri, 25 Mar 2016 16:52:18 -0700

Based on input from James Topping of Duke, I have come to learn the following:


There are two different mechanisms of data query.

1. Query against a set of SAS tables (a SAS dataset), or a locked, static RDBMS 
(which I understand to mean a database whose contents are temporarily frozen .
   For those folks who are doing a regular CDM v3 build as part of their EHR 
ETL process,  this would imply making a single point-in-time backup copy of the 
CDM tables after the diagnostic query is run - the idea being that if the Duke 
folks detect problems, they can point specifically to the data that caused it, 
with some guarantee that the problem data hasn't changed under them).

2. "Menu-driven query" through PopMedNet to access the RDBMS directly. No 
locking of the data is necessary.
However, option 2 is not yet rolled out.

In either case, it is important to note that one of the major changes between 
CDM v2 and v3 is that in v2, instead of using the intrinsic date/time RDBMS 
data types (which SAS can understand), the tables used VARCHAR(N) fields to 
store text as YYYY-MM-DD or HH:MM:SS. Nathan Graham noted that queries that 
involved date fields ran very slowly because the data had to be converted into 
dates by the DBMS optimizer dynamically, a row at a time.   This design mistake 
was rectified in version 3 where the document now specifies that intrinsic 
RDBMS date types should be used.

(Our ETL generated intrinsic date/time types, and some "hello world" SAS 
queries against these fields worked fine.)

I've been studying the SAS documentation for connectivity to RDBMSs, and they 
specify that if you use ODBC, there are two ways of accessing RDBMS data.
1. The preferred approach (in my opinion) is using "SQL pass-through", where 
the SQL is sent directly to the RDBMS without interception/interpretation by 
SAS. This allows the DBMS optimizer to do its thing (e.g., use server-based 
indexes) , so that queries run at server speeds.
2. The alternative method (using the SAS LIBNAME command) is more flexible, in 
that SAS's dialect of SQL is employed instead (so that certain intrinsic SAS 
functions without an RDBMS counterpart may be leveraged - e.g., for specialized 
statistical analyses), but all processing is done in SAS (i.e., on the client), 
so that queries involving multi-table joins, or accessing large amounts of raw 
data that return concise summaries (e.g, GROUP BY queries) run very slowly 
because of network-bandwidth becomes a rate limiting factor.

Prakash


________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the 
Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and 
may be legally privileged.  If you are not the intended recipient, you are 
hereby notified that any retention, dissemination, distribution, or copying of 
this communication is strictly prohibited.  Please reply to the sender that you 
have received the message in error, then delete it.  Thank you.
________________________________

_______________________________________________
Gpc-dev mailing list
[email protected]
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Progress on the CDM v3 query

Reply via email to