Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Jules Kerssemakers Thu, 16 Dec 2010 06:25:57 -0800

Hello Thomas,

Our system is actually a bit lazy about that. I just did a substructure
search for "C" in our system, which returns 12989 hits (from a catalog of
15700 compounds). This takes 28 seconds, nearly all of which is spent in the
substructure-matching as far as I can see from "top"
This is on a server machine: 2x quad-core Intel(R) Xeon(R) L5335 @ 2.00GHz
with 16G memory. There is no explicit multi-thread optimisation, but since a
lot of things get passed via files between processes, there may be some
accidental multithreading.


Once again: how we do it: (slightly condensed version of earlier message)
Our python website script directs the process through the following pipeline

   - JME applet molfile variable POST-ed to website script
   - site-script writes molfile to on-disk MOL-file in tmp-dir
   - cmdline call to fortran-program to canonicalize and calculate
   properties
   - cmdline call to java CDK-fingerprinter app on query structure (no
   longer in current CDK), writes to diskfile
   - python-website-script reads fingerprint from diskfile, reads properties
   from diskfile, constructs SQL-query including fingerprint.
   - all fingerprint-matching ID's are written to a disk-file
   - cshell script eats matching-id-file, concatenates all molfiles from
   disk-directory into large SDF-file
   - cmdline call to CDK substructureFinder app (no longer in current CDK)
   eats SDF file, writes textfile with ID's matching substructure.
   - python website script counts number of lines in
   substructure-id-text-file (cmdline "wc -l"), reads the ID on the first line
   and shows that single molecule. pagination ('next'/'prev') is done by
   skipping to the linenumber matching the current 'page'.

Considering the fact that our systems uses 5 different programming languages
and creates 19 on-disk files for each query, I won't say this is the way to
go. The read-write cache of the hard-disks probably saves us from most of
the I/O penalties this setup incurs.
(Also, in the interest of preserving my programming-credibility, I wish to
add I wasn't around when this was written ;-) )

Does this answer your question Thomas?

Best regards,
Jules

On 15 December 2010 16:37, Thomas Strunz <beginn...@hotmail.de> wrote:

>  What do you guys actually do when a search returns a lot of hits?
>
> Maybe you mentioned but I don't remember.
> Maxiumum hit limitation?
> How fast can you retrieve like 10k hits?
>
> also the hits actually have to be fetched/created twice, once for subgraph
> matching and later for display.
>
> The main limitation seems to be the whole AtomContainer implementation or
> object hierarchy. Tried to browse through the source code what actually
> happens when reading molfiles and well pretty complex.
>
> I don't see any possibility to improve perfromance of my method, creating
> takes long and "caching" them takes to much memory.
>
> Regards,
>
> Thomas
>
> ------------------------------
> Date: Tue, 14 Dec 2010 14:10:49 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
>
>
> On 14 December 2010 14:09, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>
> Why Molfiles? Well simple to answer, my source was an sdfile and i just
> imported that one. It's just so common and the standard to
> share/move/migrate chemical "databases".
> The format seems pretty straight forward an rather simple. I don't believe
> cml would be faster (xml overhead) but I could certainly be wrong.
>
>
>
> As a matter of fact, MOL is only the default storage in AMBIT -
> theoretically the database support almost all of CDK supported IO formats.
> The first AMBIT version (few years ago) was using CML instead of MOL by
> default. It was later abandoned not really for performance  reasons, but
> because of some inconsistencies in the CML reader/writer code at that time.
> Things may have changed since. The default MOL format was selected in for
> purely pragmatic reasons. The vast majority of files used are SDF or MOL,
> there is rarely anybody who uses (and shares data in ) CML in the context of
> toxicity predictions, thus it doesn't make sense to do unnecessary format
> conversions. One could just read the MOL and dump it into the blob field.
>
> Being aware MOL is not perfect (and having authored few presentations
> explaining why CML is the best format for the DB) , I would add here few
> discussion points.
>
> CML (and any XML schema)  is usually handled by XML library, which adds its
> overhead. DOM is particularly awful, fortunately there are SAX and StAX
> libraries as well, but this still adds more complexity than plain text
> (wonder why plain text JSON gained such a popularity these days ...).
>
> MOL is a simple text format, designed during times when memory and CPU were
> far more precious than nowadays, and more efficient for reading and writing.
>
>
>
> Regards,
> Nina
>
>
> From: j.kerssemak...@cmbi.ru.nl
> Date: Tue, 14 Dec 2010 12:11:03 +0100
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> To: beginn...@hotmail.de
> CC: jeliazkova.n...@gmail.com; cdk-user@lists.sourceforge.net
>
>
> One thing that strikes me as missing from this optimisation discussion:
> How do the other on-disk formats perform?
> Everybody always seems to go blindly for MOL-files but I've never
> understood why. Is there some inherent superiority in the MOL-format, or is
> it "just because everybody else uses it"?
> Why not any other of the IO-formats the CDK supports? (CML was mentioned a
> lot on these lists with hints of being the ultimate new format, but I don't
> know enough to speculate.)
>
> @PreparedStatements: Yes, that matters a LOT indeed. Good programming!
> For others: prepared statements allow the database to parse the
> query-string only once, and just replace the variables you put in, rather
> than (re-)parsing each individual query-string to get the query AND
> variables. The parsing can be skipped if the query part has the same
> structure each time, as is the case in this scenario. This saves a lot of
> (costly) text-interpreting.
>
> Cheers!
> Jules
>
> On 14 December 2010 10:46, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi all,
>
> I created 2 new tables to store aromaticity (1 for bonds, 1 for atoms) and
> populated them.
> I adjusted the "fetching" of molecules to use these tables instead of the
> CDKHueckelAromaticityDetector. But no speed improvement.
> (the PreparedStatments are passed-in so that they can be reused by setting
> the statements parameter instead of creating a new statement for each
> Molecule, it matters a lot)
>
>     private boolean setAromaticity(IMolecule mol,
>             PreparedStatement getAromaticAtoms,
>             PreparedStatement getAromaticBonds) throws SQLException {
>
>         boolean isAromatic = false;
>         ResultSet aromaticAtoms = getAromaticAtoms.executeQuery();
>         ResultSet aromaticBonds = getAromaticBonds.executeQuery();
>
>         try {
>
>             int atomIndex;
>
>             while (aromaticAtoms.next()) {
>                 atomIndex =
> aromaticAtoms.getInt(getAtomNumberColumnName());
>                 mol.getAtom(atomIndex).setFlag(CDKConstants.ISAROMATIC,
> true);
>                 isAromatic = true;
>             }
>
>             int bondIndex;
>
>             while (aromaticBonds.next()) {
>                 bondIndex =
> aromaticBonds.getInt(getBondNumberColumnName());
>                 mol.getBond(bondIndex).setFlag(CDKConstants.ISAROMATIC,
> true);
>             }
>
>             return isAromatic;
>
>         } finally {
>             if (aromaticAtoms != null) {
>                 aromaticAtoms.close();
>                 ;
>             }
>             if (aromaticBonds != null) {
>                 aromaticBonds.close();
>             }
>         }
>     }
>
> ------------------------------
> Date: Mon, 13 Dec 2010 21:56:24 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
>
>
>
> On 13 December 2010 21:35, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>
> Hi all,
>
>
>
> *Fortunately, the CDK code that reads MOL files adds atoms and bonds in
> the same order, as in the MOL file, otherwise, it would be trickier.*
>
> Yeah I looked at the MDLV2000Reader Source code and if it does not change
> that should be fairly easy to achieve.
>
>
> Of course my next thought was why not store all atoms and bonds and the
> relevent properties? So that you can just create the atomcontainer by
> setBonds and setAtoms.
>
>
>
>
> Because that would take up a lot of space? Hard to tell I'm not so familiar
> (yet?) with CDKsource code and what properties atoms and bonds have that are
> actually relevant for fingerprinting and subgraph matching.
>
>
> atom types and aromaticity flags at first place.
>
>
>  Also it's kind of hard to actually see what properties/flags are available
> (set and get Flags, CDKConstants).
> But anyway what I'm trying to suggest or ask or what poped into my mind is
> why not use hibernate (or something similar; an idea which is of course
> contradicting to my previous comment about storing all aromatic atoms as
> being stupid)? Ok, I'm not very familiar with either (cdk or hibernate, like
> how do you add an id for hibernate to an existing class?) and cdk object
> hierarchy in my unexperiencied eyes is rather complex and maybe not ideal
> for hibernate this might be a ridicouls idea. of course creating mapping
> file would be a rather tedious and annoying task but you could clearly
> specifiy which information you actually want to store. Ok, there would be a
> lot more rows and columns in the database but each field will contain a lot
> less data compared to having varchar/clob field for molfiles. Maybe it would
> not take that much more storage space than having molfiles and probably
> would perfrom better especially compared to clob-columns.
>
> end of brainstorming,
>
>
> Storing atoms and bonds separately is a valid option, as well as using
> hibernate.  What is important is not really the amount of storage , but how
> much faster or slower it is to read atoms and bonds from different columns
> and rows, rather than single blob field for mol file. I could imagine
> reading many columns and rows and combiningis slower, but haven't seen a
> benchmark, would be interesting if there is one.
>
> Nina
>
>
>
> Thomas
>
>
>
> > From: j.kerssemak...@cmbi.ru.nl
> > Date: Mon, 13 Dec 2010 15:45:19 +0100
>
> > Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and
> cdk-1.3.7 Isomorphism Class
> > To: jeliazkova.n...@gmail.com
> > CC: beginn...@hotmail.de; cdk-user@lists.sourceforge.net
>
> >
> > Just a short note to mention that I'm closely following this topic. A
> > major rewrite of our own database system is somewhere in the near
> > future, so this is good reading! Thanks for sharing!
> >
> > ~Jules Kerssemakers
> >
> > On 13 December 2010 08:51, Nina Jeliazkova <jeliazkova.n...@gmail.com>
> wrote:
> > > Hi Thomas,
> > >
> > > On 10 December 2010 20:04, Thomas Strunz <beginn...@hotmail.de> wrote:
> > >>
> > >> Sorry for calling you stupid. ;)
> > >>
> > >
> > > ;)
> > >
> > >>
> > >>  I just meant if you have like 100'000 Molecules and assuming 25 % are
> > >> aromatic  probably mostly benzene rings = 6 molecules + bonds  that
> leads to
> > >> 12* 25'000 = 300'000 records. Ok that's manageable since it's only an
> ID and
> > >> a bit. But depends mostly on the dataset. My focus is on smaller
> molecules.
> > >> Probably also the reason by graph matching does not seem to be that
> big of a
> > >> problem.
> > >
> > >  just single field with all the additional info for atoms and bonds.
> Not
> > > pretending this is the best way, just a simple one.
> > >>
> > >> How do you Map a certain Atom or Bond form the Database to the right
> one
> > >> in the AtomContainer created from Molfile?
> > >> Does Atom class also have an id like molecule class? Then it would not
> be
> > >> that difficult.
> > >>
> > >
> > > Fortunately, the CDK code that reads MOL files adds atoms and bonds in
> the
> > > same order, as in the MOL file, otherwise, it would be trickier.
> > > Regards,
> > > Nina
> > >>
> > >> have a nice weekend
> > >>
> > >> Regards,
> > >>
> > >> Thomas
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >
> > >
>
>
>
>
>

------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to