Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
Hi Maarten, On 2021-09-09 17:54, Maarten L. Hekkelman wrote: > Op 09-09-2021 om 15:14 schreef Andrius Merkys: >>> But I would not mind having a system wide service to update data files >>> like these. Perhaps with a log with version info, so you can look up >>> what version was used at what date. >> Indeed, it would be nice to find a generic solution, but this might be >> tricky. There are conflicting needs of stability (no updates), freshness >> (updates every day) and multi-user support (no updates and updates >> everyday all at once on the same machine). The only solution I can think >> of now is keeping all the downloaded versions with version/date in their >> names like: >> >> /var/cache/pdb/components/components-20210814.cif.gz >> /var/cache/pdb/components/components-20210820.cif.gz >> /var/cache/pdb/components/components-20210826.cif.gz >> ... >> (maybe /var/cache/pdb/components/components.cif.gz symlink to the latest) >> >> Then a user would use environment variable, say, PDB_COMPONENTS to point >> to a file with version in its name should they need a specific stable >> database, and would use /var/cache/pdb/components/components.cif.gz >> should they need the most up-to-date one. >> >> Does this sound reasonable? > > I think a bit more is required, when looking at the FAIR principles[1] I > can see a few other issues coming up. What would be nice is to have e.g. > a JSON file along with the data containing a hash, download date and > other meta data for the data files available. Then if you store the hash > (and perhaps more meta data) for the data file along with your results, > you can always recover what version of the datafile was used. > > In the PDB-REDO database we're trying to do this for e.g. the version of > all the tools used to create a record. I agree that additional persistent download log would be beneficial. I would prefer linear comma-separated or tab-separated value list to simplify reading and writing, but the format is more of a matter of taste :) > [1] https://en.wikipedia.org/wiki/FAIR_data Best, Andrius
Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
Hi Andrius, Op 09-09-2021 om 15:14 schreef Andrius Merkys: But I would not mind having a system wide service to update data files like these. Perhaps with a log with version info, so you can look up what version was used at what date. Indeed, it would be nice to find a generic solution, but this might be tricky. There are conflicting needs of stability (no updates), freshness (updates every day) and multi-user support (no updates and updates everyday all at once on the same machine). The only solution I can think of now is keeping all the downloaded versions with version/date in their names like: /var/cache/pdb/components/components-20210814.cif.gz /var/cache/pdb/components/components-20210820.cif.gz /var/cache/pdb/components/components-20210826.cif.gz ... (maybe /var/cache/pdb/components/components.cif.gz symlink to the latest) Then a user would use environment variable, say, PDB_COMPONENTS to point to a file with version in its name should they need a specific stable database, and would use /var/cache/pdb/components/components.cif.gz should they need the most up-to-date one. Does this sound reasonable? I think a bit more is required, when looking at the FAIR principles[1] I can see a few other issues coming up. What would be nice is to have e.g. a JSON file along with the data containing a hash, download date and other meta data for the data files available. Then if you store the hash (and perhaps more meta data) for the data file along with your results, you can always recover what version of the datafile was used. In the PDB-REDO database we're trying to do this for e.g. the version of all the tools used to create a record. -maarten [1] https://en.wikipedia.org/wiki/FAIR_data -- Maarten L. Hekkelman http://www.hekkelman.com/
Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
Hi Maarten, On 2021-09-08 22:50, Maarten L. Hekkelman wrote: > > Op 8-9-2021 om 09:07 schreef Andrius Merkys: >> I am aware of solutions to similar problems, for example, libcifpp >> package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at >> /var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for >> components.cif.gz as well, but my main concern is whether keeping >> system-wide components.cif.gz up-to-date is what every user would want. > > The latest incarnation of libcifpp already caches a copy of > components.cif.gz in the exact same location. Not in Debian yet, will > upload when I have the time. > > The way I do it in libcifpp is place a distribution provided copy in > /usr/share/libcifpp/ I also install a script that weekly fetches a fresh > copy that is then installed in /var/cache/libcifpp. For both the mmcif > dictionary as well as the CCD components.cif file. > > The update script runs only when the accompanying settings file > /etc/libcifpp.conf contains the line 'update = yes'. > > The installation of this script is also a dpkg configuration option. That is great news! I think I will use libcifpp-provided components.cif.gz as a fallback for the time being. Most likely I will stick with the distribution-provided copy, but it is probably best inquiring the OpenStructure/ProMod3 community about what they deem to be the best approach. Nevertheless, I will ask about an environment variable to control the choice. > I thought that covered all cases. > > But I would not mind having a system wide service to update data files > like these. Perhaps with a log with version info, so you can look up > what version was used at what date. Indeed, it would be nice to find a generic solution, but this might be tricky. There are conflicting needs of stability (no updates), freshness (updates every day) and multi-user support (no updates and updates everyday all at once on the same machine). The only solution I can think of now is keeping all the downloaded versions with version/date in their names like: /var/cache/pdb/components/components-20210814.cif.gz /var/cache/pdb/components/components-20210820.cif.gz /var/cache/pdb/components/components-20210826.cif.gz ... (maybe /var/cache/pdb/components/components.cif.gz symlink to the latest) Then a user would use environment variable, say, PDB_COMPONENTS to point to a file with version in its name should they need a specific stable database, and would use /var/cache/pdb/components/components.cif.gz should they need the most up-to-date one. Does this sound reasonable? Best, Andrius
Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
Op 8-9-2021 om 09:07 schreef Andrius Merkys: I am aware of solutions to similar problems, for example, libcifpp package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at /var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for components.cif.gz as well, but my main concern is whether keeping system-wide components.cif.gz up-to-date is what every user would want. The latest incarnation of libcifpp already caches a copy of components.cif.gz in the exact same location. Not in Debian yet, will upload when I have the time. The way I do it in libcifpp is place a distribution provided copy in /usr/share/libcifpp/ I also install a script that weekly fetches a fresh copy that is then installed in /var/cache/libcifpp. For both the mmcif dictionary as well as the CCD components.cif file. The update script runs only when the accompanying settings file /etc/libcifpp.conf contains the line 'update = yes'. The installation of this script is also a dpkg configuration option. I thought that covered all cases. But I would not mind having a system wide service to update data files like these. Perhaps with a log with version info, so you can look up what version was used at what date. -maarten
Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
Hi Michael, On 2021-09-08 11:50, Michael Crusoe wrote: > I would advocate for a local copy (if missing) and an environment > variable to override so that users can get a newer/different version. A fallback copy sounds good. Perhaps it would be best to package it in a separate source/binary package to maintain its independence. From codesearch.d.o [1] it seems that more source packages would be happy to use it. I will talk to the upstream about an environment variable with a sensible default. > I would also encourage upstream to find a way to embed a hash + download > date in their logs and outputs, if possible. Keeping track of such things is usually left for the user, but I agree that improving provenance record makes sense. > We should also ask PDB to version their files. Do they keep old versions > around? This components.cif.gz is a database of chemical compounds, and each compound entry has its modification date. Thus the latest date in components.cif.gz could be treated as some sort of version identification for the database. As for old versions, I need to ask. I do not seem to find them on their FTP server. [1] https://codesearch.debian.net/search?q=components.cif=1 Thanks, Andrius
Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
I would advocate for a local copy (if missing) and an environment variable to override so that users can get a newer/different version. I would also encourage upstream to find a way to embed a hash + download date in their logs and outputs, if possible. We should also ask PDB to version their files. Do they keep old versions around? -- Michael R. Crusoe On Wed, Sep 8, 2021, 09:07 Andrius Merkys wrote: > Hi all, > > On 2021-07-19 10:24, Nilesh Patra wrote: > > On 19 July 2021 12:50:03 pm IST, Andrius Merkys > wrote: > >> Currently I am looking into ProMod3 [3], which seems to be the engine > >> behind the great SWISS-MODEL service [4]. I seem to have figured out > >> the > >> dependencies, will go on to packaging next. > > Let us know if you need help with packaging the chain, in case you need > helping hands :-) > > So here I am asking for help/suggestions :) > > Problem: OpenStructure, a dependency of ProMod3, requires PDB components > library, components.cif.gz, for some of its protein modeling routines. > This library is provided by the PDB at [1] and is itself freely > distributable (PDB discourages from modifying it though), but is updated > quite often and does not get a version number. Furthermore, people often > prefer to obtain the most up-to-date copy of components.cif.gz for their > research, thus providing it in a Debian package of its own would not be > very convenient. > > I am aware of solutions to similar problems, for example, libcifpp > package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at > /var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for > components.cif.gz as well, but my main concern is whether keeping > system-wide components.cif.gz up-to-date is what every user would want. > > As a researcher I do my best to perform reproducible science. Thus I > want to know precise versions/timestamps/checksums of my input > databases, and have them suddenly change overnight is something akin to > a nightmare. What is more, there might be more than one user on a > machine wanting different versions of components.cif.gz. > > Thus my candidate solution for providing components.cif.gz for > OpenStructure would be to talk to the upstream to implement an > environment variable allowing for greater flexibility. Or maybe there > are other solutions? > > [1] ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz > > Best, > Andrius > >
Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]
Hi all, On 2021-07-19 10:24, Nilesh Patra wrote: > On 19 July 2021 12:50:03 pm IST, Andrius Merkys wrote: >> Currently I am looking into ProMod3 [3], which seems to be the engine >> behind the great SWISS-MODEL service [4]. I seem to have figured out >> the >> dependencies, will go on to packaging next. > Let us know if you need help with packaging the chain, in case you need > helping hands :-) So here I am asking for help/suggestions :) Problem: OpenStructure, a dependency of ProMod3, requires PDB components library, components.cif.gz, for some of its protein modeling routines. This library is provided by the PDB at [1] and is itself freely distributable (PDB discourages from modifying it though), but is updated quite often and does not get a version number. Furthermore, people often prefer to obtain the most up-to-date copy of components.cif.gz for their research, thus providing it in a Debian package of its own would not be very convenient. I am aware of solutions to similar problems, for example, libcifpp package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at /var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for components.cif.gz as well, but my main concern is whether keeping system-wide components.cif.gz up-to-date is what every user would want. As a researcher I do my best to perform reproducible science. Thus I want to know precise versions/timestamps/checksums of my input databases, and have them suddenly change overnight is something akin to a nightmare. What is more, there might be more than one user on a machine wanting different versions of components.cif.gz. Thus my candidate solution for providing components.cif.gz for OpenStructure would be to talk to the upstream to implement an environment variable allowing for greater flexibility. Or maybe there are other solutions? [1] ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz Best, Andrius
Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
On 19.07.21 09:28, Michael Banck wrote: On Sun, Jul 18, 2021 at 08:47:23PM +0200, Steffen Möller wrote: Following the references in https://www.nature.com/articles/d41586-021-01968-y I found this reference https://github.com/RosettaCommons/RoseTTAFold/tags but I admittedly cannot tell that I'd have fully grasped who is doing what and publishes what software where, yet. https://arstechnica.com/science/2021/07/google-details-its-protein-folding-software-academics-offer-an-alternative/ tried to break it down. Reddit and Michael from Rechenkraft.net pointed me to https://www.nature.com/articles/s41586-021-03828-1 which announces AlphaFold predictions for all human proteins - ready to download from the EBI. In my mind this makes it more, not less, important to get this all into Debian, so people can also start fiddling with the model and not only create new predictions (which they should also do). Best, Steffen
Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
On 19.07.21 20:03, Andrius Merkys wrote: Hi Nilesh, On 2021-07-19 10:24, Nilesh Patra wrote: On 19 July 2021 12:50:03 pm IST, Andrius Merkys wrote: Currently I am looking into ProMod3 [3], which seems to be the engine behind the great SWISS-MODEL service [4]. I seem to have figured out the dependencies, will go on to packaging next. Let us know if you need help with packaging the chain, in case you need helping hands :-) Thanks a lot for your kind offer! I will let you know as soon as I reach a point of effort parallelization :) I looked at the dependencies for alphaFold and broke them down on https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=1840067013 May I ask you to add arrange the dependencies of SWISS-MODEL/ProMod3 in that sheet, too? Best, Steffen
Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
Hi Nilesh, On 2021-07-19 10:24, Nilesh Patra wrote: > On 19 July 2021 12:50:03 pm IST, Andrius Merkys wrote: >> Currently I am looking into ProMod3 [3], which seems to be the engine >> behind the great SWISS-MODEL service [4]. I seem to have figured out >> the >> dependencies, will go on to packaging next. > Let us know if you need help with packaging the chain, in case you need > helping hands :-) Thanks a lot for your kind offer! I will let you know as soon as I reach a point of effort parallelization :) Best, Andrius
Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
On Sun, Jul 18, 2021 at 08:47:23PM +0200, Steffen Möller wrote: > Following the references in > > https://www.nature.com/articles/d41586-021-01968-y > > I found this reference > > https://github.com/RosettaCommons/RoseTTAFold/tags > > but I admittedly cannot tell that I'd have fully grasped who is doing > what and publishes what software where, yet. https://arstechnica.com/science/2021/07/google-details-its-protein-folding-software-academics-offer-an-alternative/ tried to break it down. Michael
Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
On 19 July 2021 12:50:03 pm IST, Andrius Merkys wrote: >Currently I am looking into ProMod3 [3], which seems to be the engine >behind the great SWISS-MODEL service [4]. I seem to have figured out >the >dependencies, will go on to packaging next. Let us know if you need help with packaging the chain, in case you need helping hands :-) >[3] https://www.openstructure.org/promod3 >[4] https://swissmodel.expasy.org/interactive Nilesh -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
Hi Steffen, On 2021-07-18 21:47, Steffen Möller wrote: > Following the references in > > https://www.nature.com/articles/d41586-021-01968-y > > I found this reference > > https://github.com/RosettaCommons/RoseTTAFold/tags > > but I admittedly cannot tell that I'd have fully grasped who is doing > what and publishes what software where, yet. This would definitely be great to have in Debian. I expect this protein structure predictor heavily depends on pre-trained neural networks. Making them DFSG-compliant might be very difficult (see Unofficial Policy for Debian & Machine Learning [1]). > The context: > > The biochemistry happens in 3D as proteins (and some functional RNA), > while all we easily get are 1D DNA sequence data from which one can > predict the sequences of amino acids that form the protein. To get from > that polymer of amino acids to the 3D structure one typically looks for > patterns of sequences that have been observed in structures that have > already been determined. Or one try computing that structure de novo, > i.e. without a template, and ... wait. > > Once we have the protein structures one can better understand the effect > of mutations, and start other sorts of simulations, like disturbing > protein interactions with compounds, i.e. find new drugs. I am particularly interested in this. However, this field is quite unrepresented in Debian, AFAIK. Some time ago I packaged MacroMoleculeBuilder [2], which does homology modeling. All I have tried is a couple of examples, but it is definitely worth giving a look. Currently I am looking into ProMod3 [3], which seems to be the engine behind the great SWISS-MODEL service [4]. I seem to have figured out the dependencies, will go on to packaging next. [1] https://salsa.debian.org/deeplearning-team/ml-policy [2] https://simtk.org/projects/rnatoolbox [3] https://www.openstructure.org/promod3 [4] https://swissmodel.expasy.org/interactive Best wishes, Andrius
DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced
Following the references in https://www.nature.com/articles/d41586-021-01968-y I found this reference https://github.com/RosettaCommons/RoseTTAFold/tags but I admittedly cannot tell that I'd have fully grasped who is doing what and publishes what software where, yet. The context: The biochemistry happens in 3D as proteins (and some functional RNA), while all we easily get are 1D DNA sequence data from which one can predict the sequences of amino acids that form the protein. To get from that polymer of amino acids to the 3D structure one typically looks for patterns of sequences that have been observed in structures that have already been determined. Or one try computing that structure de novo, i.e. without a template, and ... wait. Once we have the protein structures one can better understand the effect of mutations, and start other sorts of simulations, like disturbing protein interactions with compounds, i.e. find new drugs. Best, Steffen