Hi Asad, I should have mentioned your SMSD branch, but it only occurred to me after sending my second post:
http://github.com/asad/cdk-smsd I was looking through it last night, and wondered which of the various algorithms would be suitable for altering to give common subgraphs of all sizes. VFlib? Anyway, it looks interesting, and I hope to try it out sometime. Gilleain On Tue, Mar 9, 2010 at 8:01 AM, Syed Asad Rahman <a...@ebi.ac.uk> wrote: > Hello Everyone, > > Its a very interesting topic at least from my point of view :-) > > Let me clear two points. > > a) Subgraph doesn't mean Max Common Subgraph (MCS) > b) In CDK UIT (as well as with SMSD) you can obtain possible subgraph(s) of > certain resolution/size. You just need a small tweak in the code which > returns MCS. > > Let me know if this help and I will be glad to help. > > Cheers > > > Asad > > > > ************************************************ > Syed Asad Rahman (PhD, PG, B.Engg) > Research Scientist > > EMBL-EBI > WTGC, > Hinxton CB10 1SD > Cambridge, UK > > Phone: +44- (0)1223 492537 > Fax: +44- (0)1223 494486 > a...@ebi.ac.uk > www.ebi.ac.uk /~asad > > ************************************************* > > On 8 Mar 2010, at 21:33, cdk-user-requ...@lists.sourceforge.net wrote: > >> Send Cdk-user mailing list submissions to >> cdk-user@lists.sourceforge.net >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://lists.sourceforge.net/lists/listinfo/cdk-user >> or, via email, send a message with subject or body 'help' to >> cdk-user-requ...@lists.sourceforge.net >> >> You can reach the person managing the list at >> cdk-user-ow...@lists.sourceforge.net >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Cdk-user digest..." >> >> >> Today's Topics: >> >> 1. Re: Subgraph Isomorphisms (gilleain torrance) >> 2. Re: Subgraph Isomorphisms (Leonid Chepelev) >> 3. Re: Subgraph Isomorphisms (Leonid Chepelev) >> 4. Re: Subgraph Isomorphisms (gilleain torrance) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Mon, 8 Mar 2010 15:59:29 +0000 >> From: gilleain torrance <gilleain.torra...@gmail.com> >> Subject: Re: [Cdk-user] Subgraph Isomorphisms >> To: cdk-user@lists.sourceforge.net >> Message-ID: >> <d0d153661003080759j6213d58fwbb0c63bd0ae82...@mail.gmail.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hello, >> >> The term 'subgraph isomorphism' is (I think) usually used to mean >> finding a subgraph of a graph G that is isomorphic to another graph H. >> What you describe sounds like the more difficult problem of finding >> all common subgraphs of G and H. >> >> So, I am not so sure that the javadocs are unclear in this case. There >> is the possibility that H may be found multiple times in G (a polymer, >> for example, will have multiple copies of the monomer in it), which is >> presumably why the return value of these methods are lists. >> >> As for how this could be done - well the simplest (and most expensive) >> way would be to enumerate every (connected?) subgraph of H, testing >> each one to see if it is also a subgraph of G. >> >> Gilleain Torrance >> >> On Mon, Mar 8, 2010 at 11:10 AM, Leonid Chepelev >> <leonid.chepe...@gmail.com> wrote: >>> Hello All, >>> >>> When trying to find all the subgraph isomorphisms, one would anticipate from >>> the javadocs?that the use of the >>> UniversalIsomorphismTester.getSubgraphAtomsMaps of CDK 1.2.5?would >>> return?ALL the subgraph isomorphisms in a molecule of interest. However, >>> that is not the case. I am posting the code below to provide an example of >>> this behaviour: for the two molecules in question, we can see that there is >>> only one MCS from the MCSS code?(as one would expect), but also only two >>> lists of RMaps from getSubgraphAtomsMaps, indicating only two subgraphs >>> identified. >>> >>> Do you believe it would be possible to get ALL the subgraphs, no matter how >>> small, into the getSubgraphAtomsMaps, as one would anticipate from the >>> description, or is there another way to do this within (or outside CDK)? >>> >>> Am?I doing things correctly? >>> >>> If the behaviour of getSubgraphAtomsMaps is intentionally such that it only >>> returns?n biggest subgraph matches, may I kindly?suggest that the javadoc >>> description be changed? >>> >>> Regards, >>> >>> Leonid Chepelev >>> >>> *************************************************************** >>> import java.util.Iterator; >>> import java.util.List; >>> import org.openscience.cdk.io.SMILESWriter; >>> import java.io.StringWriter; >>> import org.openscience.cdk.DefaultChemObjectBuilder; >>> import org.openscience.cdk.interfaces.IAtomContainer; >>> import org.openscience.cdk.isomorphism.mcss.RMap; >>> import org.openscience.cdk.Molecule; >>> import org.openscience.cdk.smiles.SmilesParser; >>> import org.openscience.cdk.isomorphism.UniversalIsomorphismTester; >>> public class Main { >>> ??? public static void main(String[] args) throws Exception { >>> ??????? SmilesParser smilesParser = new >>> SmilesParser(DefaultChemObjectBuilder.getInstance()); >>> ??????? String referenceMolecule = "C1CCCCC1CC(=O)OCCCC"; >>> ??????? String testmolecule = "C1CCCCC1CC(=O)O"; >>> ??????? IAtomContainer molecule = >>> smilesParser.parseSmiles(referenceMolecule); >>> ??????? IAtomContainer lookup = smilesParser.parseSmiles(testmolecule); >>> ??????? List allthefragments = >>> UniversalIsomorphismTester.getOverlaps(molecule, lookup); >>> ??????? for (Iterator<IAtomContainer> i = allthefragments.iterator(); >>> i.hasNext( ); ) { >>> ???????? IAtomContainer s = i.next( ); >>> ???????? try { >>> ???????????? StringWriter funna = new StringWriter(); >>> ???????????? SMILESWriter writemysmiles = new SMILESWriter(); >>> ???????????? writemysmiles.setWriter(funna); >>> ???????????? Molecule themoleculetowrite = new Molecule(s); >>> ???????????? writemysmiles.write(themoleculetowrite); >>> ???????????? writemysmiles.close(); >>> ???????????? System.out.println(funna.toString()); >>> ???????????? funna.close(); >>> ???????????? } catch (Exception e) { >>> ??????????????? System.out.println(e.toString()); >>> ???????????? } >>> ??????? } >>> ??????? List<List<RMap>> atommappingsad = >>> UniversalIsomorphismTester.getSubgraphAtomsMaps(molecule, lookup); >>> ???????? for (List<RMap> j:atommappingsad) { >>> ??????????? for (RMap estro:j){ >>> ??????????????? try { >>> ???????????????????? int firsta = estro.getId1(); >>> ???????????????????? int seconda = estro.getId2(); >>> ???????????????????? System.out.println("Atom " + firsta + " in G1?maps to >>> atom " + seconda + " in G2."); >>> ??????????????? } catch (Exception e) { >>> ???????????????????? System.out.println(e.toString()); >>> ??????????????? } >>> ??????????? } >>> ???????? System.out.println("End of mapped fragment..."); >>> ???????? } >>> ?? } >>> } >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Cdk-user mailing list >>> Cdk-user@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/cdk-user >>> >>> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Mon, 8 Mar 2010 13:29:56 -0500 >> From: Leonid Chepelev <leonid.chepe...@gmail.com> >> Subject: Re: [Cdk-user] Subgraph Isomorphisms >> To: cdk-user@lists.sourceforge.net >> Message-ID: >> <fbfdc0f41003081029r76d8fde0ib3a7d3dc743d8...@mail.gmail.com> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Thank you for answering, Gilleain. >> >> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean >>> finding a subgraph of a graph G that is isomorphic to another graph H. What >>> you describe sounds like the more difficult problem of finding all common >>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in >>> this case. >> >> >> Hm, I couldn't really find an authoritative definition - and I am certain a >> number of people could argue about this, but from the literature I've seen, >> I always thought that two graphs may have a large number of isomorphisms, >> and it is the largest isomorphism that is selected by procedures like >> maximal common substructure searching, e.g. via the Ullman algorithm. When >> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively >> extend the initial matched fragment, it would seem a trivial task to output >> at every step of the iterative extension the subgraph, or a part of it, that >> does not yet violate edge/node matching criteria set forth in the subgraph >> isomorphism searching algorithm. As far as I understood, these subgraphs are >> collected anyway for the purposes of comparison, unless superseded by an >> extension, and only the result of the longest extension-matching procedure >> is selected as the MCS. So, I erred and believed that the >> getSubgraphAtomsMaps would give me precisely that, a collection of all the >> fragments that match and their variations. >> >> There is the possibility that H may be found multiple times in G (a polymer, >>> for example, will have multiple copies of the monomer in it), which is >>> sumably why the return value of these methods are lists. >> >> >> Again, I thought that was for the MCS, but please, feel free to let me know >> if I was wrong. I just wanted to make certain that CDK does not currently do >> this and that I am not re-inventing the wheel here. And again, I suppose I >> just misinterpreted the javadoc, sorry. >> >> As for how this could be done - well the simplest (and most expensive) way >>> would be to enumerate every (connected?) subgraph of H, testing each one to >>> see if it is also a subgraph of G. Gilleain Torrance >> >> >> That's one method that I have tried, however, in many cases, the cost of >> enumerating all the possible connected subgraphs of H is astronomical. For >> instance, if we cut five bonds of a molecule that contains 40 bonds, we will >> have about 7.5e10 fragments to enumerate. If we engage in growing fragments >> with all possible branches, that also often leads to combinatorial >> explosions. I have created an 'approximate solution', which is more >> efficient than what I am outlining here, but it's still an approximation. >> So, for all practical purposes, that is not a feasible approach. Well, it >> would have been easier to do it as I mentioned, that all fragments beyond a >> certain size would be reported in the generic collection at every iteration >> of the subgraph isomorphism detection tool. This has already been done and >> published numerous times, without the release of the source code. Oh well, >> if no one currently does this, I guess I'll go ahead and reinvent the wheel >> yet again. >> >> Any other opinions, anyone? I know you are there! >> >> Regards, >> >> Leonid Chepelev >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> >> ------------------------------ >> >> Message: 3 >> Date: Mon, 8 Mar 2010 13:32:46 -0500 >> From: Leonid Chepelev <leonid.chepe...@gmail.com> >> Subject: Re: [Cdk-user] Subgraph Isomorphisms >> To: cdk-user@lists.sourceforge.net >> Message-ID: >> <fbfdc0f41003081032q305ff913mae75c8ea7008b...@mail.gmail.com> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Sorry, in the previous message, 5 bonds out of 40 should be 10 bonds out of >> 60. The point is still valid, however. >> >> Regards, >> >> Leonid Chepelev >> >> On Mon, Mar 8, 2010 at 1:29 PM, Leonid Chepelev >> <leonid.chepe...@gmail.com>wrote: >> >>> Thank you for answering, Gilleain. >>> >>> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean >>>> finding a subgraph of a graph G that is isomorphic to another graph H. What >>>> you describe sounds like the more difficult problem of finding all common >>>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in >>>> this case. >>> >>> >>> Hm, I couldn't really find an authoritative definition - and I am certain a >>> number of people could argue about this, but from the literature I've seen, >>> I always thought that two graphs may have a large number of isomorphisms, >>> and it is the largest isomorphism that is selected by procedures like >>> maximal common substructure searching, e.g. via the Ullman algorithm. When >>> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively >>> extend the initial matched fragment, it would seem a trivial task to output >>> at every step of the iterative extension the subgraph, or a part of it, that >>> does not yet violate edge/node matching criteria set forth in the subgraph >>> isomorphism searching algorithm. As far as I understood, these subgraphs are >>> collected anyway for the purposes of comparison, unless superseded by an >>> extension, and only the result of the longest extension-matching procedure >>> is selected as the MCS. So, I erred and believed that the >>> getSubgraphAtomsMaps would give me precisely that, a collection of all the >>> fragments that match and their variations. >>> >>> There is the possibility that H may be found multiple times in G (a >>>> polymer, for example, will have multiple copies of the monomer in it), >>>> which >>>> is sumably why the return value of these methods are lists. >>> >>> >>> Again, I thought that was for the MCS, but please, feel free to let me know >>> if I was wrong. I just wanted to make certain that CDK does not currently do >>> this and that I am not re-inventing the wheel here. And again, I suppose I >>> just misinterpreted the javadoc, sorry. >>> >>> As for how this could be done - well the simplest (and most expensive) way >>>> would be to enumerate every (connected?) subgraph of H, testing each one to >>>> see if it is also a subgraph of G. Gilleain Torrance >>> >>> >>> That's one method that I have tried, however, in many cases, the cost of >>> enumerating all the possible connected subgraphs of H is astronomical. For >>> instance, if we cut five bonds of a molecule that contains 40 bonds, we will >>> have about 7.5e10 fragments to enumerate. If we engage in growing fragments >>> with all possible branches, that also often leads to combinatorial >>> explosions. I have created an 'approximate solution', which is more >>> efficient than what I am outlining here, but it's still an approximation. >>> So, for all practical purposes, that is not a feasible approach. Well, it >>> would have been easier to do it as I mentioned, that all fragments beyond a >>> certain size would be reported in the generic collection at every iteration >>> of the subgraph isomorphism detection tool. This has already been done and >>> published numerous times, without the release of the source code. Oh well, >>> if no one currently does this, I guess I'll go ahead and reinvent the wheel >>> yet again. >>> >>> Any other opinions, anyone? I know you are there! >>> >>> Regards, >>> >>> Leonid Chepelev >>> >>> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> >> ------------------------------ >> >> Message: 4 >> Date: Mon, 8 Mar 2010 21:33:11 +0000 >> From: gilleain torrance <gilleain.torra...@gmail.com> >> Subject: Re: [Cdk-user] Subgraph Isomorphisms >> To: cdk-user@lists.sourceforge.net >> Message-ID: >> <d0d153661003081333x1c920044u23e6bd64e9169...@mail.gmail.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hi, >> >> No problem. I should point out first that I am not a mathematician, >> but I have done some small work on implementing maximal common >> subgraph and subgraph isomorphism. Admittedly not in chemicals, but in >> protein folds. Terminology can differ, so I agree that the javadocs >> should be as clear as possible. Perhaps a sentence or two on what the >> code means when it says 'subgraph isomorphism' wouldn't go amiss. >> >> I don't know much about the workings of the UIT, but it does sound >> like the scheme you outline would work. Assuming, that is, that that >> every possible starting point is taken. In other words, for each atom >> of the smaller graph the substructure is extended as much as possible. >> I guess that must happen, otherwise it wouldn't work. >> >> Anyway, looking at the code, it seems that there is an object called >> an 'RGraph' or 'Resolution graph' that I would probably call an edge >> product graph. The maximal clique in an edge product graph is the >> maximum common substructure between the two original graphs. I haven't >> read enough of the code yet to know if this is what it is doing. The >> RGraph class is quite interesting though. >> >> You are quite right when you say that trying every possible subgraph >> would be very computationally expensive - it wasn't a very helpful >> suggestion on my part! What might also be helpful would be if I >> finished off the signature stuff, because then this task could be >> achieved by generating canonical signatures at heights up to the >> diameter of the smaller graph, and finding matches. Slightly more >> tricky would be to find the atom-atom mappings. >> >> Anyway, hope that the CDK can help with whatever this project is :) >> >> Gilleain >> >> On Mon, Mar 8, 2010 at 6:29 PM, Leonid Chepelev >> <leonid.chepe...@gmail.com> wrote: >>> Thank you for answering, Gilleain. >>>> >>>> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean >>>> finding a subgraph of a graph G that is isomorphic to another graph H.?What >>>> you describe sounds like the more difficult problem of finding all common >>>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in >>>> this case. >>> >>> Hm, I couldn't really find an authoritative definition - and I am certain a >>> number of people could argue about this, but from the literature I've seen, >>> I always thought that two graphs may have a large number of isomorphisms, >>> and it is the largest isomorphism that is selected by procedures like >>> maximal common substructure searching, e.g. via the Ullman algorithm. When >>> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively >>> extend the initial matched fragment, it would seem a trivial task to output >>> at every step of the iterative extension the subgraph, or a part of it, that >>> does not yet violate edge/node matching criteria set forth in the subgraph >>> isomorphism searching algorithm. As far as I understood, these subgraphs are >>> collected anyway for the purposes of comparison, unless superseded by an >>> extension, and only the result of the longest extension-matching procedure >>> is selected as the MCS. So, I erred and believed that the >>> getSubgraphAtomsMaps would give me precisely that, a collection of all the >>> fragments that match and their variations. >>>> >>>> There is the possibility that H may be found multiple times in G (a >>>> polymer, for example, will have multiple copies of the monomer in it), >>>> which >>>> is sumably why the return value of these methods are lists. >>> >>> Again, I thought that was for the MCS, but please, feel free to let me know >>> if I was wrong. I just wanted to make certain that CDK does not currently do >>> this and that I am not re-inventing the wheel here. And again, I suppose I >>> just misinterpreted the javadoc, sorry. >>>> >>>> ?As for how this could be done - well the simplest (and most expensive) >>>> way would be to enumerate every (connected?) subgraph of H, testing each >>>> one >>>> to see if it is also a subgraph of G. Gilleain Torrance >>> >>> That's one method that I have tried, however, in many cases, the cost of >>> enumerating all the possible connected subgraphs of H is astronomical. For >>> instance, if we cut five bonds of a molecule that contains 40 bonds, we will >>> have about 7.5e10 fragments to enumerate. If we engage in growing fragments >>> with all possible branches, that also often leads to combinatorial >>> explosions. I have created an 'approximate solution', which is more >>> efficient than what I am outlining here, but it's still an approximation. >>> So, for all practical purposes, that is not a feasible approach. Well, it >>> would have been easier to do it as I mentioned, that all fragments beyond a >>> certain size would be reported in the generic collection at every iteration >>> of the subgraph isomorphism detection tool. This has already been done and >>> published numerous times, without the release of the source code. Oh well, >>> if no one currently does this, I guess I'll go ahead and reinvent the wheel >>> yet again. >>> Any other opinions, anyone? I know you are there! >>> Regards, >>> Leonid Chepelev >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Cdk-user mailing list >>> Cdk-user@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/cdk-user >>> >>> >> >> >> >> ------------------------------ >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> >> ------------------------------ >> >> _______________________________________________ >> Cdk-user mailing list >> Cdk-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/cdk-user >> >> >> End of Cdk-user Digest, Vol 46, Issue 3 >> *************************************** > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Cdk-user mailing list > Cdk-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/cdk-user > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user