Re: [Rdkit-discuss] parsing reactions for reactants, agents, products

Benjamin Datko Thu, 24 Oct 2019 09:34:52 -0700

Hi Greg,

Thanks for all the info! After Hongbin's email, I found a DayLight page on
reaction agents (
https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html)
but I did not realize there was a full manual on the standardization.


The method RemoveUnmappedReactantTemplates() on the chemical reaction
> object is there for a few reasons. The primary one is that "real world"
> reaction data often isn't 100% clean and can include solvents/reagents in
> the reactants section (be that in SMILES or RXN files).
> RemoveUnmappedReactantTemplates() solves that using a simple heuristic:
> "reactants" that contain more than a threshold percentage of unmapped atoms
> are either completely removed or marked as agents


Yes, thank you for explaining the reasoning of the implementation. I saw
this idea, first, from your paper (
https://pubs.acs.org/doi/10.1021/ci5006614). After playing with your
Jupyter notebooks and going the objects and methods stepwise I stumbled
upon the RemoveUnmappedReactantTemplates method. Gotta say, thank you for
supplying the notebooks in your SI. Walking through the notebooks really
expedites my learning of RDKit. =)

Very Respectfully,

Ben

On Wed, Oct 23, 2019 at 2:37 AM Greg Landrum <[email protected]> wrote:

> Hi Benjamin,
>
> On Tue, Oct 22, 2019 at 4:32 PM Benjamin Datko <
> [email protected]> wrote:
>
>> Hi Hongbin,
>>
>> Thank you for breaking the code down. I am still new to python and all
>> its pythonic ways. I did not notice that reactants, agents, and products
>> were delimited by '>'. After you break it down, the code does lose its
>> magic. =P
>>
>
> Yeah, as long as you have clean reaction data that works great (see below).
>
>
>>
>> Do happen to have any good references on hand that describes some of the
>> standards used in RDKit or the Cheminformatics on molecular and reaction
>> repersentation?
>>
>
> The SMiLES/SMARTS based formats are documented here:
> https://www.daylight.com/dayhtml/doc/theory/
> That's a great place to start.
>
>
>> The second question corresponds to the discussion I found in this thread (
>> https://sourceforge.net/p/rdkit/mailman/message/36316849/). I believe
>> this parameters correspond to the PgSQL RDKit implementation, but I am not
>> sure. Below I show a recursive search from the downloaded source of RDKit
>> from GitHub (https://github.com/greglandrum/rdkit).
>>
>
> The method RemoveUnmappedReactantTemplates() on the chemical reaction
> object is there for a few reasons. The primary one is that "real world"
> reaction data often isn't 100% clean and can include solvents/reagents in
> the reactants section (be that in SMILES or RXN files).
> RemoveUnmappedReactantTemplates() solves that using a simple heuristic:
> "reactants" that contain more than a threshold percentage of unmapped atoms
> are either completely removed or marked as agents
>
> Hope this helps.
> -greg
>
>
>> $ pwd
>> Downloads/rdkit-master
>>
>> $ grep -r move_unmmapped_reactants_to_agents .
>> ./Code/PgSQL/rdkit/guc.c:static bool
>> *rdkit_move_unmmapped_reactants_to_agents* = true;
>> ./Code/PgSQL/rdkit/guc.c:   "*rdkit.move_unmmapped_reactants_to_agents*",
>>  ./Code/PgSQL/rdkit/guc.c:   &*rdkit_move_unmmapped_reactants_to_agents*
>> ,
>> ./Code/PgSQL/rdkit/guc.c:  return
>> *rdkit_move_unmmapped_reactants_to_agents*;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=false;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET
>> *rdkit.move_unmmapped_reactants_to_agents*=true;
>>
>> $ grep -r threshold_unmapped_reactant_atoms .
>> ./Code/PgSQL/rdkit/guc.c:static double rdkit_
>> *threshold_unmapped_reactant_atom*s = 0.2;
>> ./Code/PgSQL/rdkit/guc.c:   "rdkit.*threshold_unmapped_reactant_atoms*",
>> ./Code/PgSQL/rdkit/guc.c:   &*rdkit_threshold_unmapped_reactant_atoms*,
>> ./Code/PgSQL/rdkit/guc.c:  return rdkit_
>> *threshold_unmapped_reactant_atoms*;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit.
>> *threshold_unmapped_reactant_atoms*=0.2;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit.
>> *threshold_unmapped_reactant_atoms*=0.9;
>> ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit.
>> *threshold_unmapped_reactant_atoms*=0.2;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit.
>> *threshold_unmapped_reactant_atoms*=0.2;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit.
>> *threshold_unmapped_reactant_atoms*=0.9;
>> ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit.
>> *threshold_unmapped_reactant_atoms*=0.2;
>>
>>
>>
>> On Tue, Oct 22, 2019 at 2:29 AM Hongbin Yang <[email protected]> wrote:
>>
>>> Hi Benjamin,
>>>
>>> The magic code uses a feature of python named "list comprehension".
>>> https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
>>>
>>> It does not read the rxn string directly, but splits the string first.
>>> Since the reaction string should be `reactants smiles>agents smiles>product
>>> smiles`, we can get these SMILES strings by "rxn_string.split('>')".
>>> Then for each part, we can use splitter "." to get single molecules. So
>>> finally, [mols.split('.') for mols in rxn_string.split('>')] becomes
>>> [[reactant1, reactant2, ..], [agent1, agent2, ..], [product1, product2,
>>> ...]]. But they are all SMILES strings.
>>>
>>> mols_from_smiles_list is defined here:
>>> https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py#L16
>>> It just reads the smiles strings in a list into a molecule list. The
>>> only API is uses is "Chem.MolFromSmiles".
>>>
>>> The magic code can be translated into:
>>>
>>> reactants_smiles, agents_smiles, product_smiles= mols in
>>> rxn_string.split('>')
>>> package_results = []
>>> for mols in reactants_smiles, agents_smiles, product_smiles:
>>>   x = mols.split('.')
>>>   y = mols_from_smiles_list(x)   # x is a list of SMILES, and y is a
>>> list of molecule objects
>>>   package_results.append(y)
>>> reactants, agents, products = package_results
>>>
>>> The code now is not cool enough.
>>>
>>> I have no idea with the second question. May I ask where the
>>> parameters threshold_unmapped_reactant_atoms and 
>>> move_unmmapped_reactants_to_agents
>>> are defined?
>>>
>>> Best,
>>>
>>> Hongbin Yang 杨弘宾, Ph.D.
>>> Research: Toxicophore and Chemoinformatics
>>> On 10/22/2019 13:08，Benjamin Datko<[email protected]>
>>> <[email protected]> wrote：
>>>
>>> Hello all,
>>>
>>> While reading the source code for ASKCOS (
>>> https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py)
>>> I noticed this code snippet (line 216 on the GitHub):
>>>
>>> reactants, agents, products = [mols_from_smiles_list(x) for x in
>>> [mols.split('.') for mols in rxn_string.split('>')]]
>>>
>>> When the above code is applied on a SMILES reaction string, the result
>>> unpacks the reactants, agents, and products mol objects into the respected
>>> variables, with pretty good accuracy.  The function 'mols_from_smiles'
>>> essentially just applies Chem.MolFromSmiles over a list of smiles.
>>>
>>> I think this code snippet is really cool but I cannot find any
>>> documentation on how this is working. Searching this mailing list I came
>>> across the thread (
>>> https://sourceforge.net/p/rdkit/mailman/message/36316849/) where this
>>> operation of labeling reactants, agents, and products seems to be
>>> determined by the threshold_unmapped_reactant_atoms explained in the quoted
>>> text from the message (linked above)
>>>
>>> Here's what's going on: By default the cartridge code does an extra step
>>>> after reading a reaction from SMILES/SMARTS: it looks at all the reactants
>>>> and moves any that don't have a sufficient fraction of mapped atoms to the
>>>> agents. We do this by default because the reactions that we found "in the
>>>> wild" often have agents, solvents, etc. mixed in with the reactants. The
>>>> key parameter used there is threshold_unmapped_reactant_atoms, which
>>>> defaults to 0.2.
>>>
>>>
>>> The only further reading I can find is from Greg's paper (
>>> https://pubs.acs.org/doi/10.1021/ci5006614). I have two main questions:
>>>
>>> 1. Where in the code is this atom mapping being applied? I cannot tell
>>> when this method is being applied or where the meta data is being saved.
>>> Applying the code snippet above to a SMILES reaction string results in a
>>> list of rdkit.Chem.rdchem.Mol objects. I cannot seem to find any static
>>> method or attributes specifying if it's a reactant, agent, or product when
>>> inspecting a mol object using help in a python terminal.
>>>
>>> 2. How can I change the value of the
>>> variables threshold_unmapped_reactant_atoms
>>> and move_unmmapped_reactants_to_agents? I am using rdkit version 2019.03.4
>>> in an Anaconda environment. I want to experiment changing the mapping
>>> threshold.
>>>
>>> Very Respectfully,
>>>
>>> Benjamin
>>>
>>> _______________________________________________
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] parsing reactions for reactants, agents, products

Reply via email to