Dear All,

A side comment:

The two parallel sets of chain, residue, and atom names are actually necessary for NMR structures - it is not just to keep the depositors happy. The reason is that PDB atom names reflect absolute stereochemistry. In NMR the absolute stereochemistry is not always known, but the atoms can still be distinguished on the basis of chemical shift. In order to keep the link to the measurements used to define the structure you need to keep the original names.

No opinion on the IDs, though.

Yours,
Rasmus

On 16/02/2023 20:07, Marcin Wojdyr wrote:
Dear all,

I looked at the atom table (atom_site) in the NextGen archive. It has 5 new columns: pdbx_label_index (another sequence ID) and four columns for SIFTS mapping at atom level. I used SIFTS in the past and having it in the coordinate file can simplify things. But as was stated in the announcement, the mapping is provided at many levels:

> Sequence mappings are provided in _pdbx_sifts_unp_segments and _pdbx_sifts_xref_db_segments categories for each segment, _pdbx_sifts_xref_db at residue level, and _atom_site at atom level.

I wonder, wouldn't it be sufficient to provide the mapping at the residue level? Repeating it for every atom actually makes working with the files harder.

For two reasons.

One is that additional identifiers will be confusing. IMO it's the main problem with mmCIF from the beginning -- too many identifiers. Every atom has two chain names, two sequence IDs, two residue names and two atom names. My guess is the IUCr committee tried to make both depositors and PDB happy. The depositor could name the chain or atom as they like, and if the PDB doesn't like the names, they can use the second set of names. (please correct me if you were there). Currently the PDB changes both sets of ids during deposition, so the "author" IDs are not really author's, but we still have these two sets. In some older entries author's chain B is PDB's chain A, and author's chain A is PDB's chain B. I think it's not controversial that having two alternative names for one thing is not particularly helpful here and, in this respect, the PDB format was designed better. These are old problems and I don't have hope that the double naming will be dropped. But the new file looks like this:

new-cif.png

It has four different sequence IDs.
2 – "author" ID, the same as in the PDB
1 – "label" ID
3 – new one, apparently the same as 2 for polymers and the same as 1 for non-polymers 😳 Am I missing something?
4 – UniProt-compatible
Could we avoid it?

The second reason is bloat. I maintain code for working with CIF files (it's even used in PDBe) and just yesterday a user commented that reading 5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we store each word from such a file in the simplest way in C++ (as a string), we use, say, 32 bytes per word. 5GB file with short words easily takes up 40GB of memory just for storing the words. I could work on optimizing it, but it'd help if the files were not bloated even more. The NextGen Archive has 24% more words in the atom table. All redundant.

Thanks for reading,
Marcin

------------------------------------------------------------------------

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


--
Rasmus H Fogh                                 Tel.: +44 (0)1223 353033
Global Phasing Ltd.,                          Fax : +44 (0)1223 366899
Sheraton House, Castle Park,
Cambridge CB3 0AX, United Kingdom

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to