Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

Rasmus Fogh Fri, 17 Feb 2023 02:36:38 -0800

Dear All,

A side comment:

The two parallel sets of chain, residue, and atom names are actuallynecessary for NMR structures - it is not just to keep the depositorshappy. The reason is that PDB atom names reflect absolutestereochemistry. In NMR the absolute stereochemistry is not alwaysknown, but the atoms can still be distinguished on the basis ofchemical shift. In order to keep the link to the measurements used todefine the structure you need to keep the original names.


No opinion on the IDs, though.

Yours,
Rasmus

On 16/02/2023 20:07, Marcin Wojdyr wrote:

Dear all,
I looked at the atom table (atom_site) in the NextGen archive. It has 5new columns: pdbx_label_index (another sequence ID) and four columns forSIFTS mapping at atom level.I used SIFTS in the past and having it in the coordinate file cansimplify things. But as was stated in the announcement, the mapping isprovided at many levels:
> Sequence mappings are provided in _pdbx_sifts_unp_segments and_pdbx_sifts_xref_db_segments categories for each segment,_pdbx_sifts_xref_db at residue level, and _atom_site at atom level.
I wonder, wouldn't it be sufficient to provide the mapping at theresidue level? Repeating it for every atom actually makes working withthe files harder.
For two reasons.
One is that additional identifiers will be confusing. IMO it's the mainproblem with mmCIF from the beginning -- too many identifiers. Everyatom has two chain names, two sequence IDs, two residue names and twoatom names. My guess is the IUCr committee tried to make both depositorsand PDB happy. The depositor could name the chain or atom as they like,and if the PDB doesn't like the names, they can use the second set ofnames. (please correct me if you were there). Currently the PDB changesboth sets of ids during deposition, so the "author" IDs are not reallyauthor's, but we still have these two sets. In some older entriesauthor's chain B is PDB's chain A, and author's chain A is PDB's chainB. I think it's not controversial that having two alternative names forone thing is not particularly helpful here and, in this respect, the PDBformat was designed better.These are old problems and I don't have hope that the double naming willbe dropped. But the new file looks like this:
new-cif.png

It has four different sequence IDs.
2 – "author" ID, the same as in the PDB
1 – "label" ID
3 – new one, apparently the same as 2 for polymers and the same as 1 fornon-polymers 😳 Am I missing something?
4 – UniProt-compatible
Could we avoid it?
The second reason is bloat. I maintain code for working with CIF files(it's even used in PDBe) and just yesterday a user commented thatreading 5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we storeeach word from such a file in the simplest way in C++ (as a string), weuse, say, 32 bytes per word. 5GB file with short words easily takes up40GB of memory just for storing the words. I could work on optimizingit, but it'd help if the files were not bloated even more. The NextGenArchive has 24% more words in the atom table. All redundant.
Thanks for reading,
Marcin

------------------------------------------------------------------------

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


--
Rasmus H Fogh                                 Tel.: +44 (0)1223 353033
Global Phasing Ltd.,                          Fax : +44 (0)1223 366899
Sheraton House, Castle Park,
Cambridge CB3 0AX, United Kingdom

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

Reply via email to