Thanks for the update. I've been keeping track of what formats are
supported by the PDB for a while. Just in case anyone finds it useful,
here are my notes.

wwPDB distributes files in three formats: mmCIF (primary, full name:
PDBx/mmCIF), PDB (legacy), and PDBML (mmCIF in XML, probably to avoid
parsing CIF).
Since none of them are well-suited for molecular graphics web apps,
several years ago three new formats were introduced: MMTF from RCSB,
mmJSON from PDBj and BinaryCIF from PDBe.

The MMTF project was focused on compression of files, but it wasn't
just transcribing mmCIF, it was also rethinking what is needed and how
to simplify things. It included outreach to the community and as a
result the format is now supported by PyMOL, JMol, etc. But it's
deprecated.

mmJSON is a 1:1 transcription of mmCIF in JSON. As such, it's more
useful than PDBML, except that it's only available from PDBj and
doesn't even have a spec.

BinaryCIF combines the 1:1 approach with the compression schemes from
MMTF (i.e. it uses different compression schemes for different fields
in a coordinate file).

To make a quick comparison, I took the largest file in the PDB: 8glv.

Here are the sizes (MB, gzipped ->uncompressed):
  mmCIF (cif.gz): 84 -> 432
  PDBML (xml.gz): 114 -> 4076
  MMTF (mmtf.gz): 24 -> 37
  mmJSON (json.gz): 51 -> 484
  BinaryCIF (bcif.gz): 24 -> 45

In Python, I got:
json.gz parsed in 9.5s using built-in gz and json modules
bcif.gz parsed in 24s using py-mmcif module from RCSB,
xml.gz didn't parse with built-in ElementTree or lxml on my box – 16GB
of memory is not enough.
For comparison, cif.gz is parsed using gemmi in about 10s.

mmJSON is larger than BinaryCIF, but it's parsed 2.5x faster. I guess
in C++ it would be 10x faster than in Python (using zlib-ng instead of
zlib reduces decompression time from 2s to 0.5s; libraries such as
simdjson and glaze can parse a few GBs of JSON per second). Possibly,
it's also true about BinaryCIF, which currently doesn't have a C++
library (although the python module is a C++ extension).

So, given that MMTF is now deprecated, we have three CIF-avoiding
formats: PDBML, mmJSON and BinaryCIF. If this number were to be
further reduced, IMO it'd be best to leave one simple, well-documented
JSON-based format. It'd check all the boxes, except perhaps when the
network bandwidth is truly the bottleneck and it's worth compressing
the coordinates as much as possible.

Marcin

On Fri, Jan 19, 2024 at 8:57 PM Jose Duarte
<0000ac8ce2e7d24d-dmarc-requ...@jiscmail.ac.uk> wrote:
>
> From July 2024 the PDB file archive will not be offered in the compressed 
> MMTF format anymore. Users are strongly encouraged to switch to the BinaryCIF 
> format, which has been available since 2020. Details on how to access 
> BinaryCIF (BCIF) data files for the entire PDB archive are available here.
>
> RCSB PDB support is ready to assist with any issues or questions at 
> i...@rcsb.org.
>
> Best wishes
>
> Jose
>
> ---
> Jose Duarte
> RCSB Protein Data Bank
> San Diego Supercomputer Center
> UC San Diego
> La Jolla CA, USA
>
>
> ________________________________
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to