Thanks for the update. I've been keeping track of what formats are supported by the PDB for a while. Just in case anyone finds it useful, here are my notes.
wwPDB distributes files in three formats: mmCIF (primary, full name: PDBx/mmCIF), PDB (legacy), and PDBML (mmCIF in XML, probably to avoid parsing CIF). Since none of them are well-suited for molecular graphics web apps, several years ago three new formats were introduced: MMTF from RCSB, mmJSON from PDBj and BinaryCIF from PDBe. The MMTF project was focused on compression of files, but it wasn't just transcribing mmCIF, it was also rethinking what is needed and how to simplify things. It included outreach to the community and as a result the format is now supported by PyMOL, JMol, etc. But it's deprecated. mmJSON is a 1:1 transcription of mmCIF in JSON. As such, it's more useful than PDBML, except that it's only available from PDBj and doesn't even have a spec. BinaryCIF combines the 1:1 approach with the compression schemes from MMTF (i.e. it uses different compression schemes for different fields in a coordinate file). To make a quick comparison, I took the largest file in the PDB: 8glv. Here are the sizes (MB, gzipped ->uncompressed): mmCIF (cif.gz): 84 -> 432 PDBML (xml.gz): 114 -> 4076 MMTF (mmtf.gz): 24 -> 37 mmJSON (json.gz): 51 -> 484 BinaryCIF (bcif.gz): 24 -> 45 In Python, I got: json.gz parsed in 9.5s using built-in gz and json modules bcif.gz parsed in 24s using py-mmcif module from RCSB, xml.gz didn't parse with built-in ElementTree or lxml on my box – 16GB of memory is not enough. For comparison, cif.gz is parsed using gemmi in about 10s. mmJSON is larger than BinaryCIF, but it's parsed 2.5x faster. I guess in C++ it would be 10x faster than in Python (using zlib-ng instead of zlib reduces decompression time from 2s to 0.5s; libraries such as simdjson and glaze can parse a few GBs of JSON per second). Possibly, it's also true about BinaryCIF, which currently doesn't have a C++ library (although the python module is a C++ extension). So, given that MMTF is now deprecated, we have three CIF-avoiding formats: PDBML, mmJSON and BinaryCIF. If this number were to be further reduced, IMO it'd be best to leave one simple, well-documented JSON-based format. It'd check all the boxes, except perhaps when the network bandwidth is truly the bottleneck and it's worth compressing the coordinates as much as possible. Marcin On Fri, Jan 19, 2024 at 8:57 PM Jose Duarte <0000ac8ce2e7d24d-dmarc-requ...@jiscmail.ac.uk> wrote: > > From July 2024 the PDB file archive will not be offered in the compressed > MMTF format anymore. Users are strongly encouraged to switch to the BinaryCIF > format, which has been available since 2020. Details on how to access > BinaryCIF (BCIF) data files for the entire PDB archive are available here. > > RCSB PDB support is ready to assist with any issues or questions at > i...@rcsb.org. > > Best wishes > > Jose > > --- > Jose Duarte > RCSB Protein Data Bank > San Diego Supercomputer Center > UC San Diego > La Jolla CA, USA > > > ________________________________ > > To unsubscribe from the CCP4BB list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/