I was inspired by the database meeting the end of August in Frederick to really 
dig in and get a release of DTP/NCI structure data that is as complete as 
possible. It has taken longer than I hoped, but is now ready. The processes I 
put in place should make releases easier and therefore more regular in the 
future. 

The data can be obtained via:

anonymous ftp 
host = dtpsearch.ncifcrf.gov 
file name = nci_dtp_oct2011.zip

Please let me know of any problems and feel free to send any suggestions 
regarding format or other issues. There is a readme.txt that explains the 
release. This release has already been used to update our website 
(dtp.cancer.gov) and I should have the update to the DTP structures in PubChem 
in by the end of next week.

There are two fields in this release that are new. Registration Date is when 
the compound was registered at DTP. It might be interesting to look at trends 
in structure types submitted over time. The other new field is STXT or 
structure text descriptor. For a time this field had a very specific use. I'm 
still trying to go through old files to find the documentation. I think it was 
when CAS was the input contractor, so someone might recognize the formats and 
be able to point to an alternate documentation source. 

One last note. The molecular formula field has historically been entered 
independently of the structure. One way I have been flagging compounds whose 
stored structure might not be really representative of the compound is by 
calculating the molecular formula from the structure and comparing to the 
formula in the database.  I use CDK and it needs to parameterize all the atoms 
to do this. A lot of work has been done lately to add atom parameterization and 
here are some hard numbers to show that work is paying off. A 33% reduction in 
the number of compounds with unparameterized atoms, good stuff!

Category                       using cdk 1.2.5                  using cdk 1.4.4
consistent                        257140                               259524 
inconsistent                         4204                                    
5557
unparameterized             11502                                     7745
other no comparison            631                                       651



DanZ

/********************************************
 *  Daniel Zaharevitz
 *  Chief, Information Technology Branch
 *  Developmental Therapeutics Program
 *  National Cancer Institute
 *  [email protected]
 *
 ********************************************/





------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Blueobelisk-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Reply via email to