One thing that strikes me as missing from this optimisation discussion:
How do the other on-disk formats perform?
Everybody always seems to go blindly for MOL-files but I've never understood
why. Is there some inherent superiority in the MOL-format, or is it "just
because everybody else uses it"?
Why not any other of the IO-formats the CDK supports? (CML was mentioned a
lot on these lists with hints of being the ultimate new format, but I don't
know enough to speculate.)
@PreparedStatements: Yes, that matters a LOT indeed. Good programming!
For others: prepared statements allow the database to parse the query-string
only once, and just replace the variables you put in, rather than
(re-)parsing each individual query-string to get the query AND variables.
The parsing can be skipped if the query part has the same structure each
time, as is the case in this scenario. This saves a lot of (costly)
text-interpreting.
Cheers!
Jules
On 14 December 2010 10:46, Thomas Strunz <beginn...@hotmail.de> wrote:
> Hi all,
>
> I created 2 new tables to store aromaticity (1 for bonds, 1 for atoms) and
> populated them.
> I adjusted the "fetching" of molecules to use these tables instead of the
> CDKHueckelAromaticityDetector. But no speed improvement.
> (the PreparedStatments are passed-in so that they can be reused by setting
> the statements parameter instead of creating a new statement for each
> Molecule, it matters a lot)
>
> private boolean setAromaticity(IMolecule mol,
> PreparedStatement getAromaticAtoms,
> PreparedStatement getAromaticBonds) throws SQLException {
>
> boolean isAromatic = false;
> ResultSet aromaticAtoms = getAromaticAtoms.executeQuery();
> ResultSet aromaticBonds = getAromaticBonds.executeQuery();
>
> try {
>
> int atomIndex;
>
> while (aromaticAtoms.next()) {
> atomIndex =
> aromaticAtoms.getInt(getAtomNumberColumnName());
> mol.getAtom(atomIndex).setFlag(CDKConstants.ISAROMATIC,
> true);
> isAromatic = true;
> }
>
> int bondIndex;
>
> while (aromaticBonds.next()) {
> bondIndex =
> aromaticBonds.getInt(getBondNumberColumnName());
> mol.getBond(bondIndex).setFlag(CDKConstants.ISAROMATIC,
> true);
> }
>
> return isAromatic;
>
> } finally {
> if (aromaticAtoms != null) {
> aromaticAtoms.close();
> ;
> }
> if (aromaticBonds != null) {
> aromaticBonds.close();
> }
> }
> }
>
> ------------------------------
> Date: Mon, 13 Dec 2010 21:56:24 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
>
>
>
> On 13 December 2010 21:35, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>
> Hi all,
>
>
>
> *Fortunately, the CDK code that reads MOL files adds atoms and bonds in
> the same order, as in the MOL file, otherwise, it would be trickier.*
>
> Yeah I looked at the MDLV2000Reader Source code and if it does not change
> that should be fairly easy to achieve.
>
>
> Of course my next thought was why not store all atoms and bonds and the
> relevent properties? So that you can just create the atomcontainer by
> setBonds and setAtoms.
>
>
>
>
> Because that would take up a lot of space? Hard to tell I'm not so familiar
> (yet?) with CDKsource code and what properties atoms and bonds have that are
> actually relevant for fingerprinting and subgraph matching.
>
>
> atom types and aromaticity flags at first place.
>
>
> Also it's kind of hard to actually see what properties/flags are available
> (set and get Flags, CDKConstants).
> But anyway what I'm trying to suggest or ask or what poped into my mind is
> why not use hibernate (or something similar; an idea which is of course
> contradicting to my previous comment about storing all aromatic atoms as
> being stupid)? Ok, I'm not very familiar with either (cdk or hibernate, like
> how do you add an id for hibernate to an existing class?) and cdk object
> hierarchy in my unexperiencied eyes is rather complex and maybe not ideal
> for hibernate this might be a ridicouls idea. of course creating mapping
> file would be a rather tedious and annoying task but you could clearly
> specifiy which information you actually want to store. Ok, there would be a
> lot more rows and columns in the database but each field will contain a lot
> less data compared to having varchar/clob field for molfiles. Maybe it would
> not take that much more storage space than having molfiles and probably
> would perfrom better especially compared to clob-columns.
>
> end of brainstorming,
>
>
> Storing atoms and bonds separately is a valid option, as well as using
> hibernate. What is important is not really the amount of storage , but how
> much faster or slower it is to read atoms and bonds from different columns
> and rows, rather than single blob field for mol file. I could imagine
> reading many columns and rows and combiningis slower, but haven't seen a
> benchmark, would be interesting if there is one.
>
> Nina
>
>
>
> Thomas
>
>
>
> > From: j.kerssemak...@cmbi.ru.nl
> > Date: Mon, 13 Dec 2010 15:45:19 +0100
>
> > Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and
> cdk-1.3.7 Isomorphism Class
> > To: jeliazkova.n...@gmail.com
> > CC: beginn...@hotmail.de; cdk-user@lists.sourceforge.net
>
> >
> > Just a short note to mention that I'm closely following this topic. A
> > major rewrite of our own database system is somewhere in the near
> > future, so this is good reading! Thanks for sharing!
> >
> > ~Jules Kerssemakers
> >
> > On 13 December 2010 08:51, Nina Jeliazkova <jeliazkova.n...@gmail.com>
> wrote:
> > > Hi Thomas,
> > >
> > > On 10 December 2010 20:04, Thomas Strunz <beginn...@hotmail.de> wrote:
> > >>
> > >> Sorry for calling you stupid. ;)
> > >>
> > >
> > > ;)
> > >
> > >>
> > >> I just meant if you have like 100'000 Molecules and assuming 25 % are
> > >> aromatic probably mostly benzene rings = 6 molecules + bonds that
> leads to
> > >> 12* 25'000 = 300'000 records. Ok that's manageable since it's only an
> ID and
> > >> a bit. But depends mostly on the dataset. My focus is on smaller
> molecules.
> > >> Probably also the reason by graph matching does not seem to be that
> big of a
> > >> problem.
> > >
> > > just single field with all the additional info for atoms and bonds.
> Not
> > > pretending this is the best way, just a simple one.
> > >>
> > >> How do you Map a certain Atom or Bond form the Database to the right
> one
> > >> in the AtomContainer created from Molfile?
> > >> Does Atom class also have an id like molecule class? Then it would not
> be
> > >> that difficult.
> > >>
> > >
> > > Fortunately, the CDK code that reads MOL files adds atoms and bonds in
> the
> > > same order, as in the MOL file, otherwise, it would be trickier.
> > > Regards,
> > > Nina
> > >>
> > >> have a nice weekend
> > >>
> > >> Regards,
> > >>
> > >> Thomas
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >
> > >
>
>
>
------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user