Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

2023-02-22 Thread Rasmus Fogh

Dear Marcin,

It turns out that the wheels of PDB deposition have turned very slowly 
since I left the CCPN project. The agreement between the PDB, 
BioMagResBank and CCPN was as I have described, but I gather that it has 
 since turned out that the PDB cannot accept NMR submissions as 
originally agreed, since the _atom_site.auth_atom_id requires an IUPAC 
atom name, and NMR depositions have to use non-IUPAC names. Apparently a 
different tag has been agreed now, _atom_site.pdbx_atom_ambiguity, but I 
am not sure if it has been populated in the PDB yet. So I withdraw my 
original comment.


If you are particularly interested in NMR protein structures, the CCPN 
(ccpn.ac.uk) would be the place to go.


Yours,
Rasmus


On 17/02/2023 14:13, Marcin Wojdyr wrote:

Hi Rasmus,


The two parallel sets of chain, residue, and atom names are actually
necessary for NMR structures - it is not just to keep the depositors
happy.


Is there a PDB entry that exemplifies this?
I admit I don't know how the chemical-shift based naming works, but I
see that both sets of atom names (and both sets of residue names) are
always identical, so they can't carry additional information.

Best wishes
Marcin


--
Rasmus H Fogh Tel.: +44 (0)1223 353033
Global Phasing Ltd.,  Fax : +44 (0)1223 366899
Sheraton House, Castle Park,
Cambridge CB3 0AX, United Kingdom



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

2023-02-17 Thread Marcin Wojdyr
Hi Rasmus,

> The two parallel sets of chain, residue, and atom names are actually
> necessary for NMR structures - it is not just to keep the depositors
> happy.

Is there a PDB entry that exemplifies this?
I admit I don't know how the chemical-shift based naming works, but I
see that both sets of atom names (and both sets of residue names) are
always identical, so they can't carry additional information.

Best wishes
Marcin



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

2023-02-17 Thread Rasmus Fogh

Dear All,

A side comment:

The two parallel sets of chain, residue, and atom names are actually 
necessary for NMR structures - it is not just to keep the depositors 
happy. The reason is that PDB atom names reflect absolute 
stereochemistry. In NMR the absolute stereochemistry is not always 
known,  but the atoms can still be distinguished on the basis of 
chemical shift. In order to keep the link to the measurements used to 
define the structure you need to keep the original names.


No opinion on the IDs, though.

Yours,
Rasmus

On 16/02/2023 20:07, Marcin Wojdyr wrote:

Dear all,

I looked at the atom table (atom_site) in the NextGen archive. It has 5 
new columns: pdbx_label_index (another sequence ID) and four columns for 
SIFTS mapping at atom level.
I used SIFTS in the past and having it in the coordinate file can 
simplify things. But as was stated in the announcement, the mapping is 
provided at many levels:


 > Sequence mappings are provided in _pdbx_sifts_unp_segments and 
_pdbx_sifts_xref_db_segments categories for each segment, 
_pdbx_sifts_xref_db at residue level, and _atom_site at atom level.


I wonder, wouldn't it be sufficient to provide the mapping at the 
residue level? Repeating it for every atom actually makes working with 
the files harder.


For two reasons.

One is that additional identifiers will be confusing. IMO it's the main 
problem with mmCIF from the beginning -- too many identifiers. Every 
atom has two chain names, two sequence IDs, two residue names and two 
atom names. My guess is the IUCr committee tried to make both depositors 
and PDB happy. The depositor could name the chain or atom as they like, 
and if the PDB doesn't like the names, they can use the second set of 
names. (please correct me if you were there). Currently the PDB changes 
both sets of ids during deposition, so the "author" IDs are not really 
author's, but we still have these two sets. In some older entries 
author's chain B is PDB's chain A, and author's chain A is PDB's chain 
B. I think it's not controversial that having two alternative names for 
one thing is not particularly helpful here and, in this respect, the PDB 
format was designed better.
These are old problems and I don't have hope that the double naming will 
be dropped. But the new file looks like this:


new-cif.png

It has four different sequence IDs.
2 – "author" ID, the same as in the PDB
1 – "label" ID
3 – new one, apparently the same as 2 for polymers and the same as 1 for 
non-polymers  Am I missing something?

4 – UniProt-compatible
Could we avoid it?

The second reason is bloat. I maintain code for working with CIF files 
(it's even used in PDBe) and just yesterday a user commented that 
reading 5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we store 
each word from such a file in the simplest way in C++ (as a string), we 
use, say, 32 bytes per word. 5GB file with short words easily takes up 
40GB of memory just for storing the words. I could work on optimizing 
it, but it'd help if the files were not bloated even more. The NextGen 
Archive has 24% more words in the atom table. All redundant.


Thanks for reading,
Marcin



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 





--
Rasmus H Fogh Tel.: +44 (0)1223 353033
Global Phasing Ltd.,  Fax : +44 (0)1223 366899
Sheraton House, Castle Park,
Cambridge CB3 0AX, United Kingdom



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

2023-02-16 Thread Marcin Wojdyr
Dear all,

I looked at the atom table (atom_site) in the NextGen archive. It has 5 new
columns: pdbx_label_index (another sequence ID) and four columns for SIFTS
mapping at atom level.
I used SIFTS in the past and having it in the coordinate file can simplify
things. But as was stated in the announcement, the mapping is provided at
many levels:

> Sequence mappings are provided in _pdbx_sifts_unp_segments and
_pdbx_sifts_xref_db_segments categories for each segment,
_pdbx_sifts_xref_db at residue level, and _atom_site at atom level.

I wonder, wouldn't it be sufficient to provide the mapping at the residue
level? Repeating it for every atom actually makes working with the files
harder.

For two reasons.

One is that additional identifiers will be confusing. IMO it's the main
problem with mmCIF from the beginning -- too many identifiers. Every atom
has two chain names, two sequence IDs, two residue names and two atom
names. My guess is the IUCr committee tried to make both depositors and PDB
happy. The depositor could name the chain or atom as they like, and if the
PDB doesn't like the names, they can use the second set of names. (please
correct me if you were there). Currently the PDB changes both sets of ids
during deposition, so the "author" IDs are not really author's, but we
still have these two sets. In some older entries author's chain B is PDB's
chain A, and author's chain A is PDB's chain B. I think it's not
controversial that having two alternative names for one thing is not
particularly helpful here and, in this respect, the PDB format was designed
better.
These are old problems and I don't have hope that the double naming will be
dropped. But the new file looks like this:

[image: new-cif.png]

It has four different sequence IDs.
2 – "author" ID, the same as in the PDB
1 – "label" ID
3 – new one, apparently the same as 2 for polymers and the same as 1 for
non-polymers  Am I missing something?
4 – UniProt-compatible
Could we avoid it?

The second reason is bloat. I maintain code for working with CIF files
(it's even used in PDBe) and just yesterday a user commented that reading
5j7v-assembly1.cif (4.7GB) requires a lot of memory. If we store each word
from such a file in the simplest way in C++ (as a string), we use, say, 32
bytes per word. 5GB file with short words easily takes up 40GB of memory
just for storing the words. I could work on optimizing it, but it'd help if
the files were not bloated even more. The NextGen Archive has 24% more
words in the atom table. All redundant.

Thanks for reading,
Marcin



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


[ccp4bb] wwPDB news: Prototype of PDB NextGen Archive now available

2023-02-07 Thread Deborah Harrus

Dear all,

A prototype of a next generation archive repository for the PDB is now 
available. The archive, called “NextGen”, hosts structural model files 
in PDBx/mmCIF and PDBML formats atfiles-nextgen.wwpdb.org 
. This enriched PDB archive provides 
annotation from external database resources in the metadata in addition 
to the content provided in the structure model files in the PDB main 
archive atfiles.wwpdb.org .


This prototype provides sequence annotation from external resources such 
as UniProt, SCOP2 and Pfam at atom, residue, and chain levels. This 
mapping information is derived from the Structure Integration with 
Function, Taxonomy and Sequence (SIFTS) project 
(https://www.ebi.ac.uk/pdbe/docs/sifts/), a service developed and 
maintained by the PDBe and UniProt teams at EMBL-EBI. Sequence mappings 
are provided in _pdbx_sifts_unp_segments and 
_pdbx_sifts_xref_db_segments categories for each segment, 
_pdbx_sifts_xref_db at residue level, and _atom_site at atom level.


The PDB NextGen Repository is currently updated monthly on the first 
Wednesday of the month at 00:00 UTC and is subject to change in the 
future. You can access these NextGen files at the following locations:


 * wwPDB:https://files-nextgen.wwpdb.org, rsync://rsync-nextgen.wwpdb.org
 * RCSB PDB (USA):https://files-nextgen.rcsb.org,
   rsync://rsync-nextgen.rcsb.org
 * PDBe (UK):https://ftp.ebi.ac.uk/pub/databases/pdb_nextgen/
 * PDBj (Japan):https://ftp-nextgen.pdbj.org
   

Data are structured based on entry ID with a two letter hash code, 
‘third from last character' and 'second from last character’. This hash 
code will remain consistent once PDB ID codes are extended beyond four 
characters with the pdb_ prefix.


Some examples are shown below:

 * Access entry pdb_8aly
   
athttps://files-nextgen.wwpdb.org/pdb_nextgen/data/entries/divided/al/pdb_8aly/
 * Both PDBx/mmCIF and PDBML are provided at this location. For entry
   pdb_8aly:
 o pdb_8aly_xyz-enrich.cif.gz
 o pdb_8aly_xyz-no-atom-enrich.xml.gz

Please contactinfo@wwpdb.orgwith any questions.

Read this news on the wwPDB website: 
https://www.wwpdb.org/news/news?year=2023#63cedad9b5f08ee94ab73826


Kind regards,

Deborah Harrus

PDBe

--
---
Deborah Harrus, Ph.D.
Lead Annotator
PDBe - Protein Data Bank in Europe

European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD UK

http://www.PDBe.org
---



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/