Ahh! Thank you so much, to both of you. Yes, the different meaning of H in the various contexts was tripping me up.
Also, DescribeQuery() was definitely a function that I needed for debugging this solo. Thank you. I will keep that in mind in the future. I found that this smiles (S4) is exactly what I needed: '[C,c]1(Cl)[C,c][C,c]([N,n,C,c,#1])[C,c]([C,c])[C,c]([#1])[C,c]1'. Cheers, Adam On Tue, Mar 1, 2022 at 4:32 AM Ivan Tubert-Brohman < ivan.tubert-broh...@schrodinger.com> wrote: > A minor correction: [H] by itself *is* valid and means a hydrogen atom. > The Daylight docs say as much in section 4.1. But in other contexts it > means a hydrogen count, so to be safe, always using #1 to mean a hydrogen > atom can be a good practice. > > If you are ever in doubt about how RDKit is interpreting a SMARTS, > I recommend making use of the DescribeQuery function which provides a tree > representation of a query atom or bond. For example (comments added): > > >>> mol = Chem.MolFromSmarts('[H][N,H][N,#1]') > > > >>> print(mol.GetAtomWithIdx(0).DescribeQuery()) # [H] > > AtomAtomicNum 1 = val # [H] interpreted as a hydrogen atom > > >>> print(mol.GetAtomWithIdx(1).DescribeQuery()) # [N,H] > > AtomOr > AtomType 7 = val > AtomHCount 1 = val # H interpreted as a hydrogen count > # Overall query atom means "an aliphatic nitrogen OR (any atom with one > hydrogen)! > > >>> print(mol.GetAtomWithIdx(2).DescribeQuery()) # [N,#1] > > AtomOr > AtomType 7 = val > AtomAtomicNum 1 = val # "#1" is atomic number, therefore a hydrogen > atom. > # Overall query atom means "an aliphatic nitrogen OR a hydrogen" > > One non-obvious convention in the DescribeQuery output is that AtomType > implies aliphatic when the value is a normal atomic number, or aromatic if > the atomic number is offset by 1000. For example, [n] is "AtomType 1007". > > Hope you find this approach useful in the future. > > Ivan > > On Tue, Mar 1, 2022 at 6:33 AM David Cosgrove <davidacosgrov...@gmail.com> > wrote: > >> Hi Adam >> There are a number of issues here. The key one, I think, is a >> misunderstanding about the meaning of H in SMARTS. It means "a single >> attached hydrogen", and is a qualifier for another atom, it cannot be used >> by itself. So [*H] is valid, [H] isn't. See the table at >> https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html. If you >> want to refer to an explicit hydrogen, you have to use [#1]. However, that >> will only match an explicit hydrogen in the molecule, not an implicit one. >> Thus c[#1] doesn't match anything in c1ccccc1. If you have read in a >> molecule from a molfile, for example, that has explicit hydrogens then you >> will be ok. >> >> Further to that, your SMARTS strings, at least as they have appeared in >> gmail, which may have garbled them, are incorrect. In S1, the brackets >> round [N,n,H] make it a substituent, so it will not match the indole >> nitrogen. Also, it would probably be better as [N,n;H], which would be >> read as "(aliphatic nitrogen OR aromatic nitrogen) AND 1 attached >> hydrogen." The [N,n,H] will match a methylated indole nitrogen which I >> imagine is not what you want. Similar remarks apply to S2. A SMARTS that >> matches both 6CI and PCT >> is [C,c]1(Cl)[C,c][C,c;H][C,c]([C,c])[C,c;H][C,c]1, but that won't match >> the H atoms themselves if you want to use them in the overlay, and it also >> won't work in the aliphatic case of, for example, ClC1CCC(C)CC1 because >> there the carbon atoms have 2 attached hydrogens. If you really do want >> it to match aliphatic cases as well, then you will need something >> like >> [C,c]1(Cl)[$([CH2]),$([cH])][$([CH2]),$([cH])][C,c]([C,c])[$([CH2]),$([cH])][$([CH2]),$([cH])]1 >> which is quite a mouthful. The carbons at the 2,3,5 and 6 positions on the >> ring are specified as either [CH2] or [cH]. >> >> Jupyter notebook can be really useful for debugging SMARTS patterns like >> this. The one I used was variations of >> ``` >> from rdkit import Chem >> from IPython.display import SVG >> mol = Chem.MolFromSmiles('C1=CC(=CC2=C1C=CN2C)Cl') >> qmol = Chem.MolFromSmarts('[C,c]1(Cl)[C,c][C,c][C,c]([C,c])[C,c][C,c]1') >> print(mol.GetSubstructMatches(qmol)) >> mol >> ``` >> which prints the numbers of the matching atoms and also draws the >> molecule with the match highlighted. >> Regards, >> Dave >> >> >> On Tue, Mar 1, 2022 at 1:43 AM Adam Moyer <atom.mo...@gmail.com> wrote: >> >>> Hello, >>> >>> I have a baffling case where I am trying to match substructures on two >>> ligands for the goal of aligning them. >>> >>> I have two ligands; one is a 6-chloroindole (6CI) and the other is a >>> para-chloro toluene (PCT). >>> >>> I am attempting to use the following SMARTS (S1) to match >>> them: '[C,c]1(Cl)[C,c][C,c]*([N,n,H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'. >>> For some reason S1 only finds a match in 6CI. >>> >>> When I use the following SMARTS (S2) I only match to PCT as expected: >>> '[C,c]1(Cl)[C,c][C,c]*([H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'. >>> >>> How can S1 not match PCT? S1 is strictly a superset of S2 because I am >>> using the "or" operation. Do I have a misunderstanding of how explicit >>> hydrogens work in RDKit/SMARTS? >>> >>> Lastly when I use the last SMARTS (S3) I am able to match to both, but I >>> cannot use that smarts due to other requirements in my >>> project: '[C,c]1(Cl)[C,c][C,c][C,c]([C,c,H])[C,c]([H])[C,c]1' >>> >>> Thanks! >>> Adam >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> -- >> David Cosgrove >> Freelance computational chemistry and chemoinformatics developer >> http://cozchemix.co.uk >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss