Hi Greg, Thanks for all the info! After Hongbin's email, I found a DayLight page on reaction agents ( https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html) but I did not realize there was a full manual on the standardization.
The method RemoveUnmappedReactantTemplates() on the chemical reaction > object is there for a few reasons. The primary one is that "real world" > reaction data often isn't 100% clean and can include solvents/reagents in > the reactants section (be that in SMILES or RXN files). > RemoveUnmappedReactantTemplates() solves that using a simple heuristic: > "reactants" that contain more than a threshold percentage of unmapped atoms > are either completely removed or marked as agents Yes, thank you for explaining the reasoning of the implementation. I saw this idea, first, from your paper ( https://pubs.acs.org/doi/10.1021/ci5006614). After playing with your Jupyter notebooks and going the objects and methods stepwise I stumbled upon the RemoveUnmappedReactantTemplates method. Gotta say, thank you for supplying the notebooks in your SI. Walking through the notebooks really expedites my learning of RDKit. =) Very Respectfully, Ben On Wed, Oct 23, 2019 at 2:37 AM Greg Landrum <greg.land...@gmail.com> wrote: > Hi Benjamin, > > On Tue, Oct 22, 2019 at 4:32 PM Benjamin Datko < > benjamin.datko....@gmail.com> wrote: > >> Hi Hongbin, >> >> Thank you for breaking the code down. I am still new to python and all >> its pythonic ways. I did not notice that reactants, agents, and products >> were delimited by '>'. After you break it down, the code does lose its >> magic. =P >> > > Yeah, as long as you have clean reaction data that works great (see below). > > >> >> Do happen to have any good references on hand that describes some of the >> standards used in RDKit or the Cheminformatics on molecular and reaction >> repersentation? >> > > The SMiLES/SMARTS based formats are documented here: > https://www.daylight.com/dayhtml/doc/theory/ > That's a great place to start. > > >> The second question corresponds to the discussion I found in this thread ( >> https://sourceforge.net/p/rdkit/mailman/message/36316849/). I believe >> this parameters correspond to the PgSQL RDKit implementation, but I am not >> sure. Below I show a recursive search from the downloaded source of RDKit >> from GitHub (https://github.com/greglandrum/rdkit). >> > > The method RemoveUnmappedReactantTemplates() on the chemical reaction > object is there for a few reasons. The primary one is that "real world" > reaction data often isn't 100% clean and can include solvents/reagents in > the reactants section (be that in SMILES or RXN files). > RemoveUnmappedReactantTemplates() solves that using a simple heuristic: > "reactants" that contain more than a threshold percentage of unmapped atoms > are either completely removed or marked as agents > > Hope this helps. > -greg > > >> $ pwd >> Downloads/rdkit-master >> >> $ grep -r move_unmmapped_reactants_to_agents . >> ./Code/PgSQL/rdkit/guc.c:static bool >> *rdkit_move_unmmapped_reactants_to_agents* = true; >> ./Code/PgSQL/rdkit/guc.c: "*rdkit.move_unmmapped_reactants_to_agents*", >> ./Code/PgSQL/rdkit/guc.c: &*rdkit_move_unmmapped_reactants_to_agents* >> , >> ./Code/PgSQL/rdkit/guc.c: return >> *rdkit_move_unmmapped_reactants_to_agents*; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=false; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET >> *rdkit.move_unmmapped_reactants_to_agents*=true; >> >> $ grep -r threshold_unmapped_reactant_atoms . >> ./Code/PgSQL/rdkit/guc.c:static double rdkit_ >> *threshold_unmapped_reactant_atom*s = 0.2; >> ./Code/PgSQL/rdkit/guc.c: "rdkit.*threshold_unmapped_reactant_atoms*", >> ./Code/PgSQL/rdkit/guc.c: &*rdkit_threshold_unmapped_reactant_atoms*, >> ./Code/PgSQL/rdkit/guc.c: return rdkit_ >> *threshold_unmapped_reactant_atoms*; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit. >> *threshold_unmapped_reactant_atoms*=0.2; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit. >> *threshold_unmapped_reactant_atoms*=0.9; >> ./Code/PgSQL/rdkit/expected/reaction.out:SET rdkit. >> *threshold_unmapped_reactant_atoms*=0.2; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit. >> *threshold_unmapped_reactant_atoms*=0.2; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit. >> *threshold_unmapped_reactant_atoms*=0.9; >> ./Code/PgSQL/rdkit/sql/reaction.sql:SET rdkit. >> *threshold_unmapped_reactant_atoms*=0.2; >> >> >> >> On Tue, Oct 22, 2019 at 2:29 AM Hongbin Yang <yanyangh...@163.com> wrote: >> >>> Hi Benjamin, >>> >>> The magic code uses a feature of python named "list comprehension". >>> https://www.pythonforbeginners.com/basics/list-comprehensions-in-python >>> >>> It does not read the rxn string directly, but splits the string first. >>> Since the reaction string should be `reactants smiles>agents smiles>product >>> smiles`, we can get these SMILES strings by "rxn_string.split('>')". >>> Then for each part, we can use splitter "." to get single molecules. So >>> finally, [mols.split('.') for mols in rxn_string.split('>')] becomes >>> [[reactant1, reactant2, ..], [agent1, agent2, ..], [product1, product2, >>> ...]]. But they are all SMILES strings. >>> >>> mols_from_smiles_list is defined here: >>> https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py#L16 >>> It just reads the smiles strings in a list into a molecule list. The >>> only API is uses is "Chem.MolFromSmiles". >>> >>> The magic code can be translated into: >>> >>> reactants_smiles, agents_smiles, product_smiles= mols in >>> rxn_string.split('>') >>> package_results = [] >>> for mols in reactants_smiles, agents_smiles, product_smiles: >>> x = mols.split('.') >>> y = mols_from_smiles_list(x) # x is a list of SMILES, and y is a >>> list of molecule objects >>> package_results.append(y) >>> reactants, agents, products = package_results >>> >>> The code now is not cool enough. >>> >>> I have no idea with the second question. May I ask where the >>> parameters threshold_unmapped_reactant_atoms and >>> move_unmmapped_reactants_to_agents >>> are defined? >>> >>> Best, >>> >>> Hongbin Yang 杨弘宾, Ph.D. >>> Research: Toxicophore and Chemoinformatics >>> On 10/22/2019 13:08,Benjamin Datko<benjamin.datko....@gmail.com> >>> <benjamin.datko....@gmail.com> wrote: >>> >>> Hello all, >>> >>> While reading the source code for ASKCOS ( >>> https://github.com/connorcoley/ASKCOS/blob/master/makeit/utilities/io/draw.py) >>> I noticed this code snippet (line 216 on the GitHub): >>> >>> reactants, agents, products = [mols_from_smiles_list(x) for x in >>> [mols.split('.') for mols in rxn_string.split('>')]] >>> >>> When the above code is applied on a SMILES reaction string, the result >>> unpacks the reactants, agents, and products mol objects into the respected >>> variables, with pretty good accuracy. The function 'mols_from_smiles' >>> essentially just applies Chem.MolFromSmiles over a list of smiles. >>> >>> I think this code snippet is really cool but I cannot find any >>> documentation on how this is working. Searching this mailing list I came >>> across the thread ( >>> https://sourceforge.net/p/rdkit/mailman/message/36316849/) where this >>> operation of labeling reactants, agents, and products seems to be >>> determined by the threshold_unmapped_reactant_atoms explained in the quoted >>> text from the message (linked above) >>> >>> Here's what's going on: By default the cartridge code does an extra step >>>> after reading a reaction from SMILES/SMARTS: it looks at all the reactants >>>> and moves any that don't have a sufficient fraction of mapped atoms to the >>>> agents. We do this by default because the reactions that we found "in the >>>> wild" often have agents, solvents, etc. mixed in with the reactants. The >>>> key parameter used there is threshold_unmapped_reactant_atoms, which >>>> defaults to 0.2. >>> >>> >>> The only further reading I can find is from Greg's paper ( >>> https://pubs.acs.org/doi/10.1021/ci5006614). I have two main questions: >>> >>> 1. Where in the code is this atom mapping being applied? I cannot tell >>> when this method is being applied or where the meta data is being saved. >>> Applying the code snippet above to a SMILES reaction string results in a >>> list of rdkit.Chem.rdchem.Mol objects. I cannot seem to find any static >>> method or attributes specifying if it's a reactant, agent, or product when >>> inspecting a mol object using help in a python terminal. >>> >>> 2. How can I change the value of the >>> variables threshold_unmapped_reactant_atoms >>> and move_unmmapped_reactants_to_agents? I am using rdkit version 2019.03.4 >>> in an Anaconda environment. I want to experiment changing the mapping >>> threshold. >>> >>> Very Respectfully, >>> >>> Benjamin >>> >>> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss