Hi,

No problem. I should point out first that I am not a mathematician,
but I have done some small work on implementing maximal common
subgraph and subgraph isomorphism. Admittedly not in chemicals, but in
protein folds. Terminology can differ, so I agree that the javadocs
should be as clear as possible. Perhaps a sentence or two on what the
code means when it says 'subgraph isomorphism' wouldn't go amiss.

I don't know much about the workings of the UIT, but it does sound
like the scheme you outline would work. Assuming, that is, that that
every possible starting point is taken. In other words, for each atom
of the smaller graph the substructure is extended as much as possible.
I guess that must happen, otherwise it wouldn't work.

Anyway, looking at the code, it seems that there is an object called
an 'RGraph' or 'Resolution graph' that I would probably call an edge
product graph. The maximal clique in an edge product graph is the
maximum common substructure between the two original graphs. I haven't
read enough of the code yet to know if this is what it is doing. The
RGraph class is quite interesting though.

You are quite right when you say that trying every possible subgraph
would be very computationally expensive - it wasn't a very helpful
suggestion on my part! What might also be helpful would be if I
finished off the signature stuff, because then this task could be
achieved by generating canonical signatures at heights up to the
diameter of the smaller graph, and finding matches. Slightly more
tricky would be to find the atom-atom mappings.

Anyway, hope that the CDK can help with whatever this project is :)

Gilleain

On Mon, Mar 8, 2010 at 6:29 PM, Leonid Chepelev
<[email protected]> wrote:
> Thank you for answering, Gilleain.
>>
>> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean
>> finding a subgraph of a graph G that is isomorphic to another graph H. What
>> you describe sounds like the more difficult problem of finding all common
>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in
>> this case.
>
> Hm, I couldn't really find an authoritative definition - and I am certain a
> number of people could argue about this, but from the literature I've seen,
> I always thought that two graphs may have a large number of isomorphisms,
> and it is the largest isomorphism that is selected by procedures like
> maximal common substructure searching, e.g. via the Ullman algorithm. When
> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively
> extend the initial matched fragment, it would seem a trivial task to output
> at every step of the iterative extension the subgraph, or a part of it, that
> does not yet violate edge/node matching criteria set forth in the subgraph
> isomorphism searching algorithm. As far as I understood, these subgraphs are
> collected anyway for the purposes of comparison, unless superseded by an
> extension, and only the result of the longest extension-matching procedure
> is selected as the MCS. So, I erred and believed that the
> getSubgraphAtomsMaps would give me precisely that, a collection of all the
> fragments that match and their variations.
>>
>> There is the possibility that H may be found multiple times in G (a
>> polymer, for example, will have multiple copies of the monomer in it), which
>> is sumably why the return value of these methods are lists.
>
> Again, I thought that was for the MCS, but please, feel free to let me know
> if I was wrong. I just wanted to make certain that CDK does not currently do
> this and that I am not re-inventing the wheel here. And again, I suppose I
> just misinterpreted the javadoc, sorry.
>>
>>  As for how this could be done - well the simplest (and most expensive)
>> way would be to enumerate every (connected?) subgraph of H, testing each one
>> to see if it is also a subgraph of G. Gilleain Torrance
>
> That's one method that I have tried, however, in many cases, the cost of
> enumerating all the possible connected subgraphs of H is astronomical. For
> instance, if we cut five bonds of a molecule that contains 40 bonds, we will
> have about 7.5e10 fragments to enumerate. If we engage in growing fragments
> with all possible branches, that also often leads to combinatorial
> explosions. I have created an 'approximate solution', which is more
> efficient than what I am outlining here, but it's still an approximation.
> So, for all practical purposes, that is not a feasible approach. Well, it
> would have been easier to do it as I mentioned, that all fragments beyond a
> certain size would be reported in the generic collection at every iteration
> of the subgraph isomorphism detection tool. This has already been done and
> published numerous times, without the release of the source code. Oh well,
> if no one currently does this, I guess I'll go ahead and reinvent the wheel
> yet again.
> Any other opinions, anyone? I know you are there!
> Regards,
> Leonid Chepelev
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Cdk-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to