I was inspired by the database meeting the end of August in Frederick to really dig in and get a release of DTP/NCI structure data that is as complete as possible. It has taken longer than I hoped, but is now ready. The processes I put in place should make releases easier and therefore more regular in the future.
The data can be obtained via: anonymous ftp host = dtpsearch.ncifcrf.gov file name = nci_dtp_oct2011.zip Please let me know of any problems and feel free to send any suggestions regarding format or other issues. There is a readme.txt that explains the release. This release has already been used to update our website (dtp.cancer.gov) and I should have the update to the DTP structures in PubChem in by the end of next week. There are two fields in this release that are new. Registration Date is when the compound was registered at DTP. It might be interesting to look at trends in structure types submitted over time. The other new field is STXT or structure text descriptor. For a time this field had a very specific use. I'm still trying to go through old files to find the documentation. I think it was when CAS was the input contractor, so someone might recognize the formats and be able to point to an alternate documentation source. One last note. The molecular formula field has historically been entered independently of the structure. One way I have been flagging compounds whose stored structure might not be really representative of the compound is by calculating the molecular formula from the structure and comparing to the formula in the database. I use CDK and it needs to parameterize all the atoms to do this. A lot of work has been done lately to add atom parameterization and here are some hard numbers to show that work is paying off. A 33% reduction in the number of compounds with unparameterized atoms, good stuff! Category using cdk 1.2.5 using cdk 1.4.4 consistent 257140 259524 inconsistent 4204 5557 unparameterized 11502 7745 other no comparison 631 651 DanZ /******************************************** * Daniel Zaharevitz * Chief, Information Technology Branch * Developmental Therapeutics Program * National Cancer Institute * [email protected] * ********************************************/ ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Blueobelisk-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
