Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-13 Thread Andrius Merkys
Hi Maarten,

On 2021-09-09 17:54, Maarten L. Hekkelman wrote:
> Op 09-09-2021 om 15:14 schreef Andrius Merkys:
>>> But I would not mind having a system wide service to update data files
>>> like these. Perhaps with a log with version info, so you can look up
>>> what version was used at what date.
>> Indeed, it would be nice to find a generic solution, but this might be
>> tricky. There are conflicting needs of stability (no updates), freshness
>> (updates every day) and multi-user support (no updates and updates
>> everyday all at once on the same machine). The only solution I can think
>> of now is keeping all the downloaded versions with version/date in their
>> names like:
>>
>> /var/cache/pdb/components/components-20210814.cif.gz
>> /var/cache/pdb/components/components-20210820.cif.gz
>> /var/cache/pdb/components/components-20210826.cif.gz
>> ...
>> (maybe /var/cache/pdb/components/components.cif.gz symlink to the latest)
>>
>> Then a user would use environment variable, say, PDB_COMPONENTS to point
>> to a file with version in its name should they need a specific stable
>> database, and would use /var/cache/pdb/components/components.cif.gz
>> should they need the most up-to-date one.
>>
>> Does this sound reasonable?
> 
> I think a bit more is required, when looking at the FAIR principles[1] I
> can see a few other issues coming up. What would be nice is to have e.g.
> a JSON file along with the data containing a hash, download date and
> other meta data for the data files available. Then if you store the hash
> (and perhaps more meta data) for the data file along with your results,
> you can always recover what version of the datafile was used.
> 
> In the PDB-REDO database we're trying to do this for e.g. the version of
> all the tools used to create a record.

I agree that additional persistent download log would be beneficial. I
would prefer linear comma-separated or tab-separated value list to
simplify reading and writing, but the format is more of a matter of taste :)

> [1] https://en.wikipedia.org/wiki/FAIR_data

Best,
Andrius



Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-09 Thread Maarten L. Hekkelman

Hi Andrius,

Op 09-09-2021 om 15:14 schreef Andrius Merkys:

But I would not mind having a system wide service to update data files
like these. Perhaps with a log with version info, so you can look up
what version was used at what date.

Indeed, it would be nice to find a generic solution, but this might be
tricky. There are conflicting needs of stability (no updates), freshness
(updates every day) and multi-user support (no updates and updates
everyday all at once on the same machine). The only solution I can think
of now is keeping all the downloaded versions with version/date in their
names like:

/var/cache/pdb/components/components-20210814.cif.gz
/var/cache/pdb/components/components-20210820.cif.gz
/var/cache/pdb/components/components-20210826.cif.gz
...
(maybe /var/cache/pdb/components/components.cif.gz symlink to the latest)

Then a user would use environment variable, say, PDB_COMPONENTS to point
to a file with version in its name should they need a specific stable
database, and would use /var/cache/pdb/components/components.cif.gz
should they need the most up-to-date one.

Does this sound reasonable?


I think a bit more is required, when looking at the FAIR principles[1] I 
can see a few other issues coming up. What would be nice is to have e.g. 
a JSON file along with the data containing a hash, download date and 
other meta data for the data files available. Then if you store the hash 
(and perhaps more meta data) for the data file along with your results, 
you can always recover what version of the datafile was used.


In the PDB-REDO database we're trying to do this for e.g. the version of 
all the tools used to create a record.


-maarten

[1] https://en.wikipedia.org/wiki/FAIR_data

--
Maarten L. Hekkelman
http://www.hekkelman.com/



Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-09 Thread Andrius Merkys
Hi Maarten,

On 2021-09-08 22:50, Maarten L. Hekkelman wrote:
> 
> Op 8-9-2021 om 09:07 schreef Andrius Merkys:
>> I am aware of solutions to similar problems, for example, libcifpp
>> package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at
>> /var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for
>> components.cif.gz as well, but my main concern is whether keeping
>> system-wide components.cif.gz up-to-date is what every user would want.
> 
> The latest incarnation of libcifpp already caches a copy of
> components.cif.gz in the exact same location. Not in Debian yet, will
> upload when I have the time.
> 
> The way I do it in libcifpp is place a distribution provided copy in
> /usr/share/libcifpp/ I also install a script that weekly fetches a fresh
> copy that is then installed in /var/cache/libcifpp. For both the mmcif
> dictionary as well as the CCD components.cif file.
> 
> The update script runs only when the accompanying settings file
> /etc/libcifpp.conf contains the line 'update = yes'.
> 
> The installation of this script is also a dpkg configuration option.

That is great news! I think I will use libcifpp-provided
components.cif.gz as a fallback for the time being. Most likely I will
stick with the distribution-provided copy, but it is probably best
inquiring the OpenStructure/ProMod3 community about what they deem to be
the best approach. Nevertheless, I will ask about an environment
variable to control the choice.

> I thought that covered all cases.
> 
> But I would not mind having a system wide service to update data files
> like these. Perhaps with a log with version info, so you can look up
> what version was used at what date.

Indeed, it would be nice to find a generic solution, but this might be
tricky. There are conflicting needs of stability (no updates), freshness
(updates every day) and multi-user support (no updates and updates
everyday all at once on the same machine). The only solution I can think
of now is keeping all the downloaded versions with version/date in their
names like:

/var/cache/pdb/components/components-20210814.cif.gz
/var/cache/pdb/components/components-20210820.cif.gz
/var/cache/pdb/components/components-20210826.cif.gz
...
(maybe /var/cache/pdb/components/components.cif.gz symlink to the latest)

Then a user would use environment variable, say, PDB_COMPONENTS to point
to a file with version in its name should they need a specific stable
database, and would use /var/cache/pdb/components/components.cif.gz
should they need the most up-to-date one.

Does this sound reasonable?

Best,
Andrius



Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-08 Thread Maarten L. Hekkelman



Op 8-9-2021 om 09:07 schreef Andrius Merkys:

I am aware of solutions to similar problems, for example, libcifpp
package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at
/var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for
components.cif.gz as well, but my main concern is whether keeping
system-wide components.cif.gz up-to-date is what every user would want.


The latest incarnation of libcifpp already caches a copy of 
components.cif.gz in the exact same location. Not in Debian yet, will 
upload when I have the time.


The way I do it in libcifpp is place a distribution provided copy in 
/usr/share/libcifpp/ I also install a script that weekly fetches a fresh 
copy that is then installed in /var/cache/libcifpp. For both the mmcif 
dictionary as well as the CCD components.cif file.


The update script runs only when the accompanying settings file 
/etc/libcifpp.conf contains the line 'update = yes'.


The installation of this script is also a dpkg configuration option.

I thought that covered all cases.

But I would not mind having a system wide service to update data files 
like these. Perhaps with a log with version info, so you can look up 
what version was used at what date.


-maarten



Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-08 Thread Andrius Merkys
Hi Michael,

On 2021-09-08 11:50, Michael Crusoe wrote:
> I would advocate for a local copy (if missing) and an environment
> variable to override so that users can get a newer/different version.

A fallback copy sounds good. Perhaps it would be best to package it in a
separate source/binary package to maintain its independence. From
codesearch.d.o [1] it seems that more source packages would be happy to
use it.

I will talk to the upstream about an environment variable with a
sensible default.

> I would also encourage upstream to find a way to embed a hash + download
> date in their logs and outputs, if possible.

Keeping track of such things is usually left for the user, but I agree
that improving provenance record makes sense.

> We should also ask PDB to version their files. Do they keep old versions
> around?

This components.cif.gz is a database of chemical compounds, and each
compound entry has its modification date. Thus the latest date in
components.cif.gz could be treated as some sort of version
identification for the database. As for old versions, I need to ask. I
do not seem to find them on their FTP server.

[1] https://codesearch.debian.net/search?q=components.cif=1

Thanks,
Andrius



Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-08 Thread Michael Crusoe
I would advocate for a local copy (if missing) and an environment variable
to override so that users can get a newer/different version.

I would also encourage upstream to find a way to embed a hash + download
date in their logs and outputs, if possible.

We should also ask PDB to version their files. Do they keep old versions
around?

--
Michael R. Crusoe

On Wed, Sep 8, 2021, 09:07 Andrius Merkys  wrote:

> Hi all,
>
> On 2021-07-19 10:24, Nilesh Patra wrote:
> > On 19 July 2021 12:50:03 pm IST, Andrius Merkys 
> wrote:
> >> Currently I am looking into ProMod3 [3], which seems to be the engine
> >> behind the great SWISS-MODEL service [4]. I seem to have figured out
> >> the
> >> dependencies, will go on to packaging next.
> > Let us know if you need help with packaging the chain, in case you need
> helping hands :-)
>
> So here I am asking for help/suggestions :)
>
> Problem: OpenStructure, a dependency of ProMod3, requires PDB components
> library, components.cif.gz, for some of its protein modeling routines.
> This library is provided by the PDB at [1] and is itself freely
> distributable (PDB discourages from modifying it though), but is updated
> quite often and does not get a version number. Furthermore, people often
> prefer to obtain the most up-to-date copy of components.cif.gz for their
> research, thus providing it in a Debian package of its own would not be
> very convenient.
>
> I am aware of solutions to similar problems, for example, libcifpp
> package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at
> /var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for
> components.cif.gz as well, but my main concern is whether keeping
> system-wide components.cif.gz up-to-date is what every user would want.
>
> As a researcher I do my best to perform reproducible science. Thus I
> want to know precise versions/timestamps/checksums of my input
> databases, and have them suddenly change overnight is something akin to
> a nightmare. What is more, there might be more than one user on a
> machine wanting different versions of components.cif.gz.
>
> Thus my candidate solution for providing components.cif.gz for
> OpenStructure would be to talk to the upstream to implement an
> environment variable allowing for greater flexibility. Or maybe there
> are other solutions?
>
> [1] ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz
>
> Best,
> Andrius
>
>


Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]

2021-09-08 Thread Andrius Merkys
Hi all,

On 2021-07-19 10:24, Nilesh Patra wrote:
> On 19 July 2021 12:50:03 pm IST, Andrius Merkys  wrote:
>> Currently I am looking into ProMod3 [3], which seems to be the engine
>> behind the great SWISS-MODEL service [4]. I seem to have figured out
>> the
>> dependencies, will go on to packaging next.
> Let us know if you need help with packaging the chain, in case you need 
> helping hands :-)

So here I am asking for help/suggestions :)

Problem: OpenStructure, a dependency of ProMod3, requires PDB components
library, components.cif.gz, for some of its protein modeling routines.
This library is provided by the PDB at [1] and is itself freely
distributable (PDB discourages from modifying it though), but is updated
quite often and does not get a version number. Furthermore, people often
prefer to obtain the most up-to-date copy of components.cif.gz for their
research, thus providing it in a Debian package of its own would not be
very convenient.

I am aware of solutions to similar problems, for example, libcifpp
package, which keeps an up-to-date mmcif_pdbx_v50.dic.gz at
/var/cache/libcifpp/mmcif_pdbx_v50.dic.gz. This could work for
components.cif.gz as well, but my main concern is whether keeping
system-wide components.cif.gz up-to-date is what every user would want.

As a researcher I do my best to perform reproducible science. Thus I
want to know precise versions/timestamps/checksums of my input
databases, and have them suddenly change overnight is something akin to
a nightmare. What is more, there might be more than one user on a
machine wanting different versions of components.cif.gz.

Thus my candidate solution for providing components.cif.gz for
OpenStructure would be to talk to the upstream to implement an
environment variable allowing for greater flexibility. Or maybe there
are other solutions?

[1] ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz

Best,
Andrius



Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-23 Thread Steffen Möller



On 19.07.21 09:28, Michael Banck wrote:

On Sun, Jul 18, 2021 at 08:47:23PM +0200, Steffen Möller wrote:

Following the references in

https://www.nature.com/articles/d41586-021-01968-y

I found this reference

https://github.com/RosettaCommons/RoseTTAFold/tags

but I admittedly cannot tell that I'd have fully grasped who is doing
what and publishes what software where, yet.

https://arstechnica.com/science/2021/07/google-details-its-protein-folding-software-academics-offer-an-alternative/

tried to break it down.


Reddit and Michael from Rechenkraft.net pointed me to

https://www.nature.com/articles/s41586-021-03828-1

which announces AlphaFold predictions for all human proteins - ready to
download from the EBI.

In my mind this makes it more, not less, important to get this all into
Debian, so people can also start fiddling with the model and not only
create new predictions (which they should also do).

Best,
Steffen





Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-20 Thread Steffen Möller



On 19.07.21 20:03, Andrius Merkys wrote:

Hi Nilesh,

On 2021-07-19 10:24, Nilesh Patra wrote:

On 19 July 2021 12:50:03 pm IST, Andrius Merkys  wrote:

Currently I am looking into ProMod3 [3], which seems to be the engine
behind the great SWISS-MODEL service [4]. I seem to have figured out
the
dependencies, will go on to packaging next.

Let us know if you need help with packaging the chain, in case you need helping 
hands :-)

Thanks a lot for your kind offer! I will let you know as soon as I reach
a point of effort parallelization :)


I looked at the dependencies for alphaFold and broke them down on

https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=1840067013

May I ask you to add arrange the dependencies of SWISS-MODEL/ProMod3 in
that sheet, too?

Best,
Steffen




Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-19 Thread Andrius Merkys
Hi Nilesh,

On 2021-07-19 10:24, Nilesh Patra wrote:
> On 19 July 2021 12:50:03 pm IST, Andrius Merkys  wrote:
>> Currently I am looking into ProMod3 [3], which seems to be the engine
>> behind the great SWISS-MODEL service [4]. I seem to have figured out
>> the
>> dependencies, will go on to packaging next.
> Let us know if you need help with packaging the chain, in case you need 
> helping hands :-)

Thanks a lot for your kind offer! I will let you know as soon as I reach
a point of effort parallelization :)

Best,
Andrius



Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-19 Thread Michael Banck
On Sun, Jul 18, 2021 at 08:47:23PM +0200, Steffen Möller wrote:
> Following the references in
> 
> https://www.nature.com/articles/d41586-021-01968-y
> 
> I found this reference
> 
> https://github.com/RosettaCommons/RoseTTAFold/tags
> 
> but I admittedly cannot tell that I'd have fully grasped who is doing
> what and publishes what software where, yet.

https://arstechnica.com/science/2021/07/google-details-its-protein-folding-software-academics-offer-an-alternative/

tried to break it down.


Michael



Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-19 Thread Nilesh Patra



On 19 July 2021 12:50:03 pm IST, Andrius Merkys  wrote:
>Currently I am looking into ProMod3 [3], which seems to be the engine
>behind the great SWISS-MODEL service [4]. I seem to have figured out
>the
>dependencies, will go on to packaging next.

Let us know if you need help with packaging the chain, in case you need helping 
hands :-)

>[3] https://www.openstructure.org/promod3
>[4] https://swissmodel.expasy.org/interactive

Nilesh

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-19 Thread Andrius Merkys
Hi Steffen,

On 2021-07-18 21:47, Steffen Möller wrote:
> Following the references in
> 
> https://www.nature.com/articles/d41586-021-01968-y
> 
> I found this reference
> 
> https://github.com/RosettaCommons/RoseTTAFold/tags
> 
> but I admittedly cannot tell that I'd have fully grasped who is doing
> what and publishes what software where, yet.

This would definitely be great to have in Debian. I expect this protein
structure predictor heavily depends on pre-trained neural networks.
Making them DFSG-compliant might be very difficult (see Unofficial
Policy for Debian & Machine Learning [1]).

> The context:
> 
> The biochemistry happens in 3D as proteins (and some functional RNA),
> while all we easily get are 1D DNA sequence data from which one can
> predict the sequences of amino acids that form the protein. To get from
> that polymer of amino acids to the 3D structure one typically looks for
> patterns of sequences that have been observed in structures that have
> already been determined. Or one try computing that structure de novo,
> i.e. without a template, and ... wait.
> 
> Once we have the protein structures one can better understand the effect
> of mutations, and start other sorts of simulations, like disturbing
> protein interactions with compounds, i.e. find new drugs.

I am particularly interested in this. However, this field is quite
unrepresented in Debian, AFAIK. Some time ago I packaged
MacroMoleculeBuilder [2], which does homology modeling. All I have tried
is a couple of examples, but it is definitely worth giving a look.
Currently I am looking into ProMod3 [3], which seems to be the engine
behind the great SWISS-MODEL service [4]. I seem to have figured out the
dependencies, will go on to packaging next.

[1] https://salsa.debian.org/deeplearning-team/ml-policy
[2] https://simtk.org/projects/rnatoolbox
[3] https://www.openstructure.org/promod3
[4] https://swissmodel.expasy.org/interactive

Best wishes,
Andrius



DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced

2021-07-18 Thread Steffen Möller

Following the references in

https://www.nature.com/articles/d41586-021-01968-y

I found this reference

https://github.com/RosettaCommons/RoseTTAFold/tags

but I admittedly cannot tell that I'd have fully grasped who is doing
what and publishes what software where, yet.

The context:

The biochemistry happens in 3D as proteins (and some functional RNA),
while all we easily get are 1D DNA sequence data from which one can
predict the sequences of amino acids that form the protein. To get from
that polymer of amino acids to the 3D structure one typically looks for
patterns of sequences that have been observed in structures that have
already been determined. Or one try computing that structure de novo,
i.e. without a template, and ... wait.

Once we have the protein structures one can better understand the effect
of mutations, and start other sorts of simulations, like disturbing
protein interactions with compounds, i.e. find new drugs.

Best,
Steffen