Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Thomas Strunz Wed, 15 Dec 2010 07:39:08 -0800

What do you guys actually do when a search returns a lot of hits?

Maybe you mentioned but I don't remember. 
Maxiumum hit limitation? 
How fast can you retrieve like 10k hits?

also the hits actually have to be fetched/created twice, once for subgraph 
matching and later for display.

The main limitation seems to be the whole AtomContainer implementation or 
object hierarchy. Tried to browse through the source code what actually happens 
when reading molfiles and well pretty complex.

I don't see any possibility to improve perfromance of my method, creating takes 
long and "caching" them takes to much memory.

Regards,

Thomas

Date: Tue, 14 Dec 2010 14:10:49 +0200
Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 
Isomorphism Class
From: jeliazkova.n...@gmail.com
To: beginn...@hotmail.de
CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net

On 14 December 2010 14:09, Thomas Strunz <beginn...@hotmail.de> wrote:

Why Molfiles? Well simple to answer, my source was an sdfile and i just 
imported that one. It's just so common and the standard to share/move/migrate 
chemical "databases".
The format seems pretty straight forward an rather simple. I don't believe cml 
would be faster (xml overhead) but I could certainly be wrong.

As a matter of fact, MOL is only the default storage in AMBIT - 
theoretically the database support almost all of CDK supported IO 
formats.  The first AMBIT version (few years ago) was using CML instead 
of MOL by default. It was later abandoned not really for performance  
reasons, but because of some inconsistencies in the CML reader/writer 
code at that time. Things may have changed since. The default MOL format
 was selected in for purely pragmatic reasons. The vast majority of 
files used are SDF or MOL, there is rarely anybody who uses (and shares 
data in ) CML in the context of toxicity predictions, thus it doesn't 
make sense to do unnecessary format conversions. One could just read the
 MOL and dump it into the blob field.  

Being aware MOL is not perfect (and having authored few presentations 
explaining why CML is the best format for the DB) , I would add here few
 discussion points.

CML (and any XML schema)  is usually handled by XML library, which adds 
its overhead. DOM is particularly awful, fortunately there are SAX and 
StAX libraries as well, but this still adds more complexity than plain 
text (wonder why plain text JSON gained such a popularity these days 
...).  

MOL is a simple text format, designed during times when memory and CPU 
were far more precious than nowadays, and more efficient for reading and
 writing. 

Regards,

Nina 

From: j.kerssemak...@cmbi.ru.nl

Date: Tue, 14 Dec 2010 12:11:03 +0100
Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 
Isomorphism Class
To: beginn...@hotmail.de

CC: jeliazkova.n...@gmail.com; cdk-user@lists.sourceforge.net

One thing that strikes me as missing from this optimisation discussion:
How do the other on-disk formats perform?
Everybody always seems to go blindly for MOL-files but I've never understood 
why. Is there some inherent superiority in the MOL-format, or is it "just 
because everybody else uses it"?

Why not any other of the IO-formats the CDK supports? (CML was mentioned a lot 
on these lists with hints of being the ultimate new format, but I don't know 
enough to speculate.)

@PreparedStatements: Yes, that matters a LOT indeed. Good programming!

For others: prepared statements allow the database to parse the query-string 
only once, and just replace the variables you put in, rather than (re-)parsing 
each individual query-string to get the query AND variables. The parsing can be 
skipped if the query part has the same structure each time, as is the case in 
this scenario. This saves a lot of (costly) text-interpreting.

Cheers!
Jules

On 14 December 2010 10:46, Thomas Strunz <beginn...@hotmail.de> wrote:

Hi all,

I created 2 new tables to store aromaticity (1 for bonds, 1 for atoms) and 
populated them. 
I adjusted the "fetching" of molecules to use these tables instead of the 
CDKHueckelAromaticityDetector. But no speed improvement.

(the PreparedStatments are passed-in so that they can be reused by setting the 
statements parameter instead of creating a new statement for each Molecule, it 
matters a lot)

    private boolean setAromaticity(IMolecule mol,

            PreparedStatement getAromaticAtoms,
            PreparedStatement getAromaticBonds) throws SQLException {

        boolean isAromatic = false;
        ResultSet aromaticAtoms = getAromaticAtoms.executeQuery();

        ResultSet aromaticBonds = getAromaticBonds.executeQuery();

        try {

            int atomIndex;

            while (aromaticAtoms.next()) {
                atomIndex = aromaticAtoms.getInt(getAtomNumberColumnName());

                mol.getAtom(atomIndex).setFlag(CDKConstants.ISAROMATIC, true);
                isAromatic = true;
            }

            int bondIndex;

            while (aromaticBonds.next()) {

                bondIndex = aromaticBonds.getInt(getBondNumberColumnName());
                mol.getBond(bondIndex).setFlag(CDKConstants.ISAROMATIC, true);
            }

            return isAromatic;

        } finally {
            if (aromaticAtoms != null) {
                aromaticAtoms.close();
                ;
            }
            if (aromaticBonds != null) {
                aromaticBonds.close();

            }
        }
    }

Date: Mon, 13 Dec 2010 21:56:24 +0200
Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 
Isomorphism Class

From: jeliazkova.n...@gmail.com
To: beginn...@hotmail.de
CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net

On 13 December 2010 21:35, Thomas Strunz <beginn...@hotmail.de> wrote:

Hi all,

Fortunately, the CDK code that reads MOL files adds 
atoms and bonds in the same order, as in the MOL file, otherwise, it 
would be trickier.

Yeah I looked at the MDLV2000Reader Source code and if it does not change that 
should be fairly easy to achieve.

Of course my next thought was why not store all atoms and bonds and the 
relevent properties? So that you can just create the atomcontainer by 
setBonds and setAtoms.

Because that would take up a lot of space? Hard to tell I'm not
 so familiar (yet?) with CDKsource code and what properties atoms and 
bonds have that are actually relevant for fingerprinting and subgraph 
matching. 
atom types and aromaticity flags at first place.

 Also it's kind of hard to actually see what properties/flags 
are available (set and get Flags, CDKConstants). 

But anyway what I'm trying to suggest or ask or what poped into my mind 
is why not use hibernate (or something similar; an idea which is of course 
contradicting to my previous comment about storing all aromatic atoms as being 
stupid)? Ok, I'm not very 
familiar with either (cdk or hibernate, like how do you add an id for hibernate 
to an existing class?) and cdk object hierarchy in my 
unexperiencied eyes is rather complex and maybe not ideal for hibernate 
this might be a ridicouls idea. of course creating mapping file would be
 a rather tedious and annoying task but you could clearly specifiy which
 information you actually want to store. Ok, there would be a lot more 
rows and columns in the database but each field will contain a lot less 
data compared to having varchar/clob field for molfiles. Maybe it would 
not take that much more storage space than having molfiles and probably 
would perfrom better especially compared to clob-columns. 

end of brainstorming,

Storing atoms and bonds separately is a valid option, as well as using 
hibernate.  What is important is not really the amount of storage , but how 
much faster or slower it is to read atoms and bonds from different columns and 
rows, rather than single blob field for mol file. I could imagine reading many 
columns and rows and combiningis slower, but haven't seen a benchmark, would be 
interesting if there is one.  

Nina

Thomas

> From: j.kerssemak...@cmbi.ru.nl
> Date: Mon, 13 Dec 2010 15:45:19 +0100
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 
> Isomorphism Class

> To: jeliazkova.n...@gmail.com
> CC: beginn...@hotmail.de; cdk-user@lists.sourceforge.net

> 
> Just a short note to mention that I'm closely following this topic. A
> major rewrite of our own database system is somewhere in the near
> future, so this is good reading! Thanks for sharing!

> 
> ~Jules Kerssemakers
> 
> On 13 December 2010 08:51, Nina Jeliazkova <jeliazkova.n...@gmail.com> wrote:
> > Hi Thomas,

> >
> > On 10 December 2010 20:04, Thomas Strunz <beginn...@hotmail.de> wrote:
> >>
> >> Sorry for calling you stupid. ;)

> >>
> >
> > ;)
> >
> >>
> >>  I just meant if you have like 100'000 Molecules and assuming 25 % are
> >> aromatic  probably mostly benzene rings = 6 molecules + bonds  that leads 
> >> to

> >> 12* 25'000 = 300'000 records. Ok that's manageable since it's only an ID 
> >> and
> >> a bit. But depends mostly on the dataset. My focus is on smaller molecules.
> >> Probably also the reason by graph matching does not seem to be that big of 
> >> a

> >> problem.
> >
> >  just single field with all the additional info for atoms and bonds. Not
> > pretending this is the best way, just a simple one.
> >>
> >> How do you Map a certain Atom or Bond form the Database to the right one

> >> in the AtomContainer created from Molfile?
> >> Does Atom class also have an id like molecule class? Then it would not be
> >> that difficult.
> >>
> >
> > Fortunately, the CDK code that reads MOL files adds atoms and bonds in the

> > same order, as in the MOL file, otherwise, it would be trickier.
> > Regards,
> > Nina
> >>
> >> have a nice weekend
> >>
> >> Regards,
> >>

> >> Thomas
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >

> >

------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to