We have drifted far from the original topic of this thread and if we continue I'll just make more of a fool of myself.

I'll just go back to the original topic that I started with, that encoding connectivity information into an ID is not reliable or sustainable in a relational database. I don't recall anyone in this long thread refuting this statement.

Dale Tronrud

On 12/5/2020 4:02 AM, Marcin Wojdyr wrote:
On Fri, 4 Dec 2020 at 22:36, Dale Tronrud <de...@daletronrud.com> wrote:

     It is very important not to read more meaning into a data tag than
is actually defined in the mmCIF spec.  _atom_site.label_seq_id is defined

http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.label_seq_id.html

as a pointer into the _entity_poly_seq table.

I think you inferred the relation between tables from textual
description. This should be avoided.
The relations are defined formally in the category
_pdbx_item_linked_group_list. In this case the relation between
atom_site and entity_poly_seq has three items that are to be matched
together:

_atom_site.label_comp_id = _entity_poly_seq.mon_id
_atom_site.label_entity_id = _entity_poly_seq.entity_id
_atom_site.label_seq_id = _entity_poly_seq.num

  It has to be an signed
integer (although I'm not clear on what a negative value for a pointer
means).

Do you interpret "pointer" as a row index? That's not how it's used. I
don't think that you can point to a position in the mmCIF table. In
general a "pointer" could be negative or not even numeric if the value
it points to is negative or not numeric. Although in this case
label_seq_id must be >=1, because that's the allowed range
(_item_range).

In that table there is a data item _entity_poly_seq.num,

http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_entity_poly_seq.num.html

which is not a pointer, not an ID, but a name for that particular

(it's not a pointer because there is no such a thing as a pointer in
the mmCIF technology. Or is there?)

_entity_poly_seq row.  It must be a number that is unique and
sequential, and presumably indicates a "sequence number".  Note that the
rows in _entity_poly_seq can be listed in the loop_ in any order.

For the record, here is the description:
"The value of _entity_poly_seq.num must uniquely and sequentially
identify a record in the ENTITY_POLY_SEQ list.
Note that this item must be a number and that the sequence
numbers must progress in increasing numerical order."

     This means that the _atom_site.label_seq_id could be "3", pointing
to the third entry in _entity_poly_seq which happens to have its .num
equal to "1".

No. If _atom_site.label_seq_id is 3 it points to _entity_poly_seq.num that is 3.

You may not think that someone would choose to do this,
but if the first .num is -15 you can't avoid a mismatch.  In either case
the mmCIF is perfectly acceptable and the meaning is absolutely clear.

It's formally guaranteed to be >= 1. Although it's not guaranteed that
the sequence starts with 1, because mmCIF has no way to do this. And
it's not explicitly stated in the description. So you could argue that
the sequence numbers can start with 15.
Now, the intention of _atom_site.label_seq_id has always been that
it's the position wrt the full sequence (PDB people: correct me if I'm
wrong). This is how it's interpreted by the PDB people and software.
But no one thought to explicitly write that in addition to "increasing
numerical order" the numbers must start with 1. What to do in such a
case?

There are much better examples of lacking description.
For example, if you'd interpret anisotropic ADPs according to the
mmCIF description you'd get wrong values (for non-orthogonal systems).
Because the description was copied from the small-molecule spec which
uses different axes than PDB.
Or take auth_seq_id: originally it was used for sequence ID, then it
was changed to sequence number and the definition has not been
updated.
You could find many more examples. (<I/sigma> has been extensively
debated during PDBx/mmCIF WG meetings this year).

My take in all such cases is that it's better to interpret things how
they are used and how they were intended to be interpreted rather than
hold to the wording used in the specification. In an ideal world the
specification would be always correct and would cover all corner
cases. But in the meantime it's better to focus on getting things done
with what is available.

You also can't assume that the row with
_entity_poly_seq.num equal to "3" is chemically linked to the one with
.num equal to "2", much less the chemical nature of such a link.
_entity_poly_seq is not a data table that defines chemistry, only
"sequence".

OK, so how do you propose to find links between polymer residues?
The table with connections doesn't list peptide bonds in a protein
chain - they are implicit.

Marcin


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to