Dear all, I looked at the atom table (atom_site) in the NextGen archive. It has 5 new columns: pdbx_label_index (another sequence ID) and four columns for SIFTS mapping at atom level. I used SIFTS in the past and having it in the coordinate file can simplify things. But as was stated in the announcement, the mapping is provided at many levels:
> Sequence mappings are provided in _pdbx_sifts_unp_segments and _pdbx_sifts_xref_db_segments categories for each segment, _pdbx_sifts_xref_db at residue level, and _atom_site at atom level. I wonder, wouldn't it be sufficient to provide the mapping at the residue level? Repeating it for every atom actually makes working with the files harder. For two reasons. One is that additional identifiers will be confusing. IMO it's the main problem with mmCIF from the beginning -- too many identifiers. Every atom has two chain names, two sequence IDs, two residue names and two atom names. My guess is the IUCr committee tried to make both depositors and PDB happy. The depositor could name the chain or atom as they like, and if the PDB doesn't like the names, they can use the second set of names. (please correct me if you were there). Currently the PDB changes both sets of ids during deposition, so the "author" IDs are not really author's, but we still have these two sets. In some older entries author's chain B is PDB's chain A, and author's chain A is PDB's chain B. I think it's not controversial that having two alternative names for one thing is not particularly helpful here and, in this respect, the PDB format was designed better. These are old problems and I don't have hope that the double naming will be dropped. But the new file looks like this: [image: new-cif.png] It has four different sequence IDs. 2 – "author" ID, the same as in the PDB 1 – "label" ID 3 – new one, apparently the same as 2 for polymers and the same as 1 for non-polymers 😳 Am I missing something? 4 – UniProt-compatible Could we avoid it? The second reason is bloat. I maintain code for working with CIF files (it's even used in PDBe) and just yesterday a user commented that reading 5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we store each word from such a file in the simplest way in C++ (as a string), we use, say, 32 bytes per word. 5GB file with short words easily takes up 40GB of memory just for storing the words. I could work on optimizing it, but it'd help if the files were not bloated even more. The NextGen Archive has 24% more words in the atom table. All redundant. Thanks for reading, Marcin ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
