Dear All,
A side comment:
The two parallel sets of chain, residue, and atom names are actually
necessary for NMR structures - it is not just to keep the depositors
happy. The reason is that PDB atom names reflect absolute
stereochemistry. In NMR the absolute stereochemistry is not always
known, but the atoms can still be distinguished on the basis of
chemical shift. In order to keep the link to the measurements used to
define the structure you need to keep the original names.
No opinion on the IDs, though.
Yours,
Rasmus
On 16/02/2023 20:07, Marcin Wojdyr wrote:
Dear all,
I looked at the atom table (atom_site) in the NextGen archive. It has 5
new columns: pdbx_label_index (another sequence ID) and four columns for
SIFTS mapping at atom level.
I used SIFTS in the past and having it in the coordinate file can
simplify things. But as was stated in the announcement, the mapping is
provided at many levels:
> Sequence mappings are provided in _pdbx_sifts_unp_segments and
_pdbx_sifts_xref_db_segments categories for each segment,
_pdbx_sifts_xref_db at residue level, and _atom_site at atom level.
I wonder, wouldn't it be sufficient to provide the mapping at the
residue level? Repeating it for every atom actually makes working with
the files harder.
For two reasons.
One is that additional identifiers will be confusing. IMO it's the main
problem with mmCIF from the beginning -- too many identifiers. Every
atom has two chain names, two sequence IDs, two residue names and two
atom names. My guess is the IUCr committee tried to make both depositors
and PDB happy. The depositor could name the chain or atom as they like,
and if the PDB doesn't like the names, they can use the second set of
names. (please correct me if you were there). Currently the PDB changes
both sets of ids during deposition, so the "author" IDs are not really
author's, but we still have these two sets. In some older entries
author's chain B is PDB's chain A, and author's chain A is PDB's chain
B. I think it's not controversial that having two alternative names for
one thing is not particularly helpful here and, in this respect, the PDB
format was designed better.
These are old problems and I don't have hope that the double naming will
be dropped. But the new file looks like this:
new-cif.png
It has four different sequence IDs.
2 – "author" ID, the same as in the PDB
1 – "label" ID
3 – new one, apparently the same as 2 for polymers and the same as 1 for
non-polymers 😳 Am I missing something?
4 – UniProt-compatible
Could we avoid it?
The second reason is bloat. I maintain code for working with CIF files
(it's even used in PDBe) and just yesterday a user commented that
reading 5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we store
each word from such a file in the simplest way in C++ (as a string), we
use, say, 32 bytes per word. 5GB file with short words easily takes up
40GB of memory just for storing the words. I could work on optimizing
it, but it'd help if the files were not bloated even more. The NextGen
Archive has 24% more words in the atom table. All redundant.
Thanks for reading,
Marcin
------------------------------------------------------------------------
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
--
Rasmus H Fogh Tel.: +44 (0)1223 353033
Global Phasing Ltd., Fax : +44 (0)1223 366899
Sheraton House, Castle Park,
Cambridge CB3 0AX, United Kingdom
########################################################################
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list
hosted by www.jiscmail.ac.uk, terms & conditions are available at
https://www.jiscmail.ac.uk/policyandsecurity/