Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

Marcin Wojdyr Thu, 16 Feb 2023 12:08:05 -0800

Dear all,

I looked at the atom table (atom_site) in the NextGen archive. It has 5 new
columns: pdbx_label_index (another sequence ID) and four columns for SIFTS
mapping at atom level.
I used SIFTS in the past and having it in the coordinate file can simplify
things. But as was stated in the announcement, the mapping is provided at
many levels:


> Sequence mappings are provided in _pdbx_sifts_unp_segments and
_pdbx_sifts_xref_db_segments categories for each segment,
_pdbx_sifts_xref_db at residue level, and _atom_site at atom level.

I wonder, wouldn't it be sufficient to provide the mapping at the residue
level? Repeating it for every atom actually makes working with the files
harder.

For two reasons.

One is that additional identifiers will be confusing. IMO it's the main
problem with mmCIF from the beginning -- too many identifiers. Every atom
has two chain names, two sequence IDs, two residue names and two atom
names. My guess is the IUCr committee tried to make both depositors and PDB
happy. The depositor could name the chain or atom as they like, and if the
PDB doesn't like the names, they can use the second set of names. (please
correct me if you were there). Currently the PDB changes both sets of ids
during deposition, so the "author" IDs are not really author's, but we
still have these two sets. In some older entries author's chain B is PDB's
chain A, and author's chain A is PDB's chain B. I think it's not
controversial that having two alternative names for one thing is not
particularly helpful here and, in this respect, the PDB format was designed
better.
These are old problems and I don't have hope that the double naming will be
dropped. But the new file looks like this:

[image: new-cif.png]

It has four different sequence IDs.
2 – "author" ID, the same as in the PDB
1 – "label" ID
3 – new one, apparently the same as 2 for polymers and the same as 1 for
non-polymers 😳 Am I missing something?
4 – UniProt-compatible
Could we avoid it?

The second reason is bloat. I maintain code for working with CIF files
(it's even used in PDBe) and just yesterday a user commented that reading
5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we store each word
from such a file in the simplest way in C++ (as a string), we use, say, 32
bytes per word. 5GB file with short words easily takes up 40GB of memory
just for storing the words. I could work on optimizing it, but it'd help if
the files were not bloated even more. The NextGen Archive has 24% more
words in the atom table. All redundant.

Thanks for reading,
Marcin

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

Reply via email to