Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Nina Jeliazkova Mon, 06 Dec 2010 12:05:46 -0800

Hi,

On 4 December 2010 16:41, Thomas Strunz <beginn...@hotmail.de> wrote:


>  Hi Nina,
>
> thanks for your fast response. I will probaly start storing certain
> properties in the DB because as you mentioned Serialization has it's
> drawbacks. I would of course keep the MolFile but then they must be kept
> in-sync from the application.
> About thread-safe: i don't do anything speical just use thread to put or
> take items out of a BlockingQueue so that if as Example screening found a
> hit I can instantly get it from the DB and start subgraph matching while
> screening continues.
>
> See code at end of message. The issue might be that I use "IN()" in my
> statments. So a usual statment will look like:
>
> SELECT molid, molecule FROM moltable WHERE molid IN(<List of molids>)
>
> Maybe the IN statment with a large list is bad? Or creating the statment (I
> do use stringbuilder for it).
>

"IN" might be bad indeed, here SQL EXPLAIN  could help.

>From my experience, loading all molecule fields  with a single SQL query is
not good for performance. Since all the blobs actually are loaded in memory
by the ResultSet, it will lead to Out of memory errors with sufficiently
large number of compounds.

Also, LinkedBlockingQueue has in practice infinite length (The capacity, if
unspecified, is equal to Integer.MAX_VALUE ), which means if the isomorphism
procedure is not fast enough to empty the queue, you will have lot of
molecules twice in memory (once as blobs from ResultSet and once as
IAtomContainers in the queue.

I might be missing something, running a profiler might tell you exactly
where the problem is.

Hope this helps,
Nina

But this seemed the simplest way to get a list of items.
>
> Regards,
>
> Thomas
>
> Code:
> filter is empty string or " WHERE molid IN(<list of molids>)"
>
>     private int getMoleculesDefault(
>             LinkedBlockingQueue<IMolecule> queue, String filter)
>             throws SQLException {
>
>         ResultSet resultSet = null;
>         Connection connection = getConnection();
>         Statement stmt = connection.createStatement();
>         String sqlSelect = getSelectMoleculesStatement()
>                 + filter;
>         getLogger().debug(sqlSelect);
>         resultSet = stmt.executeQuery(sqlSelect);
>         resultSet.setFetchSize(getFetchSize());
>         int counter = 0;
>         try {
>             counter = processResultSet(queue, resultSet);
>             return counter;
>         } catch (InterruptedException ex) {
>             getLogger().catching(ex);
>             getLogger().exit(counter);
>             return counter;
>         } finally {
>             if (connection != null) {
>                 connection.close();
>             }
>         }
>     }
>
> And:
>
>     protected int processResultSet(LinkedBlockingQueue<IMolecule> queue,
>             ResultSet resultSet)
>             throws SQLException, InterruptedException {
>         int counter = 0;
>         while (resultSet.next()) {
>             Integer id = resultSet.getInt(getMolIdColumnName());
>             InputStream stream =
> resultSet.getAsciiStream(getStructureColumnName());
>             try {
>
>                 MDLV2000Reader molReader = new MDLV2000Reader(stream);
>                 Molecule mol = (Molecule) molReader.read((ChemObject) new
> Molecule());
>
> AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol);
>                 CDKHueckelAromaticityDetector.detectAromaticity(mol);
>                 mol.setID(id.toString());
>                 getLogger().trace(mol.getID());
>                 queue.put(mol);
>                 counter++;
>             } catch (CDKException cdkEx) {
>                 getLogger().catching(cdkEx);
>             }
>         }
>         return counter;
>     }
>
>
> ------------------------------
> Date: Sat, 4 Dec 2010 16:04:49 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
>
> Hi Thomas,
>
> On 4 December 2010 15:47, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi all,
>
> first some questions:
> Can I set aromaticty of a Molecule manually? There is a setFlag(int
> flag_type) method but did not find a list of flag_types.
> That would prevent from having to perceive it each time a molecule is
> created and if I look at Isomorphism Class, this flag has an effect on
> subgraph matching. True?
>
>
> Yes, the flag is CDKConstants.ISAROMATIC .  These are set by Isomorphism
> class, but can be set manually as well, for example by reading precalculated
> atom and bond aromatic flags from the database (I can confirm this increases
> the performance).
>
>
>
> Observations:
> My search currently seems limited by data access and not the graph matching
> itself. I was able to solve the blocking/freezing issue (had nothing to do
> with threading just a logical error elsewhere) and using visualVm I can see
> that the thread doing the database access is the most active one (running
> 100% of the time). The graph matching thread is runnign only 20% of the
> time.
> Using a BlockingQueue has benefits but not on memory usage in this case. I
> can set it's size to 1 or 100 without an effect on memory consumption. As
> indicated Jules probably due to delayed garbage collection. Or said
> otherwise when doing a search with lots of hits you will always have a lot
> of IAtomContainer in memory, regardless of your algorithm.
>
>
> This seems to be specific to your implementation, so I am not sure what to
> say without seeing the code. We have achieved quite reasonable performance
> by storing molecules as MOL files in MySQL. I assume you are using database
> connection pool? 100% utilization of the database access thread may also
> mean non-optimized SQL query.
>
> On another note, one thread to read from DB and another thread to run CDK
> code is not a very good idea (at least currently), since there are lot of
> CDK classes, which are not thread safe.
>
>
>
> (this probably also explains why I see no benefit in using Isomorphism
> class from cdk-1.3.7 over UniversialIsomorphismTester)
>
> The conclusion is to either have more threads reading from the database or
> to serialize all molecules to the database.
>
>
> Java class serialization has the drawback that usually the serialization
> will change with slight changes of the class implementation, making
> impossible to read the molecules in the database, if the underlying library
> changes.
>
> Best regards,
> Nina
>
>
> Best Regards,
>
>
> Thomas
>
> ------------------------------
> From: j.kerssemak...@cmbi.ru.nl
> Date: Thu, 2 Dec 2010 15:44:55 +0100
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> To: beginn...@hotmail.de
> CC: cdk-user@lists.sourceforge.net
>
> Hello Thomas (and other CDK-users of course)
>
> - The aromaticity-flag could very well be in the fingerprint, I've never
> really checked that to be honest. Does anyone else know?
> The original reasoning for putting it in was that the aromaticity-detection
> is 'expensive' if you have to do it each time for a query, but it still is a
> pretty good distinction (in a general metabolite database in any case, if
> your database is 99% aromatic anyway, don't even bother ;-) ). We can
> eliminate about half our dataset by that flag so it works pretty well
>
> - The CDKHueckelAromaticityDetector does indeed modify the atomcontainer,
> setting the ISAROMATIC flags for aromatic molecules. I don't know if it does
> anything for the bond orders though..
>
> - A serialized object is a native java object, which can be directly
> unpacked into memory (fast). A molfile needs to be parsed by the molreader,
> which will always be slower. Reading from database or from disk will
> probably not matter that much in this case. varchars are stored in different
> data-blocks than the rest of the row for (I think) all database systems, so
> the database is unlikely to have these cached and therefore will still need
> to read them from disk (though from a database file rather than a normal
> directory. The performance will be about equal)
> Of course, this doesn't apply if you have your complete database in memory,
> which hsql does, I believe. Reading from memory is always way faster than
> reading from disk.
>
> - If you have optimised your method already to only load the atomcontainer
> inside the for-loop, you will always only have one IAtomcontainer per thread
> in memory(*) per running loop. If you limit the amount of queries running at
> the same time, this should solve your problem. I'm horrible at threaded
> programming, and from what I understand, so is every other human being, so
> my guess is that the freezes/blocks you see stem from the threaded part, not
> from the memory-overrun part.
>
> (*): not strictly true, the default Java-servers only start removing things
> from memory once memory becomes scarce, so there will probably be a few
> atomcontainers from previous loop iterations lying around waiting to be
> tossed away by the garbage collector.
>
> - The list of often-occuring fragments will probably not do you much good.
> Such a list is probably going to be bigger than your whole database after a
> few days/weeks of user-interactions.
> The cleaner, most often used method is (as you also suggest) to tell the
> user "This query structure generated to many preliminary hits for this
> server to handle, please be more specific."
>
> You're welcome :-)
> Best regards,
> Jules
>
> On 1 December 2010 17:54, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Jules,
>
> thank you for this detailed explanations. I have some additional questions:
>
> - Aromaticity flag: so you store a boolean value yes/No? What's the
> advantage of this? or otherwise said should that not already be covered by
> the fingerprint?
>
> - CDKHueckelAromaticityDetector: does this modify the passed in
> IAtomContainer or just returns true/false? (eg. in smsd.Isomorphism.init
> method set flag for cleanAndConfigureMolecule to false if it was already
> done after reading the molecule?
>
> - About loading from molfiles: is it much slower than loading a serialized
> molecule? Assuming the molfile is in the database.
>
> I'm already working on a method on putting the molecules in a
> BlockingQueue, e.g. have one thread read from the database and 1 (or more)
> others doing the subgraph matching. Like that I could limit the amount of
> IAtomContainers in memory. memory usage is reduced dramatically but still
> have issues (=test freezes/blocks and does not return).
>
> I agree that I need some additional steps but the example with benzene
> mentioned remains. If the query fragment is very common, any additional
> steps won't help because there actually are so many structures that match.
> My idea is to add a table with such fragments and a table with Molecules
> that contain that fragment. The fragment can be stored as canonical smiles
> or InchiKey (or an other canonical form). The first step could then be to
> check if the query is such a fragment and select all the matching Id's
> (Select with 1 join). So subgraph matching can be avoided. This could be
> enhanced by automatically adding query fragments that return to many hits
> ("the system learns").
> It could also be done otherwise liek do fingerprint screen first and only
> do this "fragment" screen if there are too many hits after fingerprinting.
> A simpler approach would be to just always only return a defined number of
> hits like max 500 and ask the user to more clearly define the query
> structure.
>
> Thanks for your help,
>
> Thomas
>
>
> ------------------------------
> From: j.kerssemak...@cmbi.ru.nl
> Date: Wed, 1 Dec 2010 12:10:28 +0100
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> To: beginn...@hotmail.de
> CC: cdk-user@lists.sourceforge.net
>
>
> Hello All,
>
> We've been using the CDK substructure search for a while now too in our
> biometa database (
> http://cheminf.cmbi.ru.nl/cgi-bin/biometa/biometa.py?molecules%20jme).
> Here is what we do to keep things manageable performanace-wise:
>
> * Pre-calculate several statistics for all entries, namely:
>  - fingerprints
>  - an aromaticity flag
>  - amount of different elements
>  - number of rings (either aromatic or non-aromatic)
>  - total amount of atoms
>  - total amount of of atoms per element (in separate table of (molecule_id,
> element_number, element_count))
> * when querying, we calculate the same properties for the search-molecule
> and then write a pretty long SQL-query that limits the results as much as
> possible as cheaply as possible:
>  - first condition is the aromatic/non-aromatic flag (single flag
> comparison --> cheapest you'll ever find)
>  - next condition is element-count, (a simple, cheap numerical
> '>='-comparison)
>  - then ringcount >=
>  - then the fingerprint comparison. We let the database do the logical AND
> and ==, because postgres has native bit-array operations, which our
> python-binding (pgsql) can't handle because it doesn't understand bit-arrays
> (it converts them to strings).
>  - finally a per-element atom-count comparison. This is last because in the
> current set-up it has to join the element-count table to the molecule table,
> which is probably slower.
> * The mol-files for the resultant mol_id's are then concatenated into a
> large sdf-file which is fed through a CDK SDFSubstructureFinder from a very
> old 2006 SVN version (dropped since, but I dare not replace it because
> everything will probably implode if I try).
>
> It's all a bit hack-ish, but it works fairly well since we don't have much
> traffic. I never actually tested how the postgres query planner handles
> this, but it is fairly smart.
>
> Things that I would change if I rewrote it all today would be:
> 1) store serialized IAtomContainers in the database, to prevent having to
> re-read the molfiles from disk every time
> 2) Use a faster substructure matcher (SMSD sounds good)
> 3) get rid of the element-count table, and rather add columns for our
> most-prevalent atom-types to the molecule rows (namely: C,N,O,H,P). This
> avoids the overhead of the join.
>
>
> Thomas, as Nina already mentioned, you shouldn't load all the molecules
> before the loop, rather load them one-by-one IN the loop. This means you
> only ever need one molecule in-memory per query, which saves MASSIVELY on
> memory requirements.
>
>
> //Pseudocode example of what you have:
> keyList = doTheSearch(); // get molecule ID's for potential candidates
> map<key, molecule> theMap = manager.getMolecules(keyList);  // This loads
> ALL your molecules into memory, OUCH!!
> for (key in map.keyset) {
>   mol = theMap.get(key);
>   if (searchtarget.isSubgraphOf(mol)) {
>     results.add(mol)
>   }
> }
>
> // Pseudocode for a more efficient way to do it (don't pre-load all
> molecules):
> keyList = doTheSearch(); // get molecule ID's for potential candidates
> for (key in keyList) {
>   mol = manager.getMolecule(key); // IMPROVEMENT: only load one molecule at
> a time!
>   if (searchtarget.isSubgraphOf(mol)) {
>     results.add(mol)
>   }
> }
>
> Hope this helps,
> Best regards,
> Jules Kerssemakers
>
> On 30 November 2010 21:19, Nina Jeliazkova <jeliazkova.n...@gmail.com>wrote:
>
> Hi Thomas,
>
> On 30 November 2010 21:58, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Nina,
>
> I sure have more than 1 IAtomContainer in memory at time so I agree that
> might be an issue but if screening lets say returns 1000 hits, 1000 subgraph
> matches must be done and hence all the 1000 Molecules must be created first.
> So you would suggest to read each one separatley from database after a
> subgraph match returns?
>
>
> What  we are doing is getting database structure identifiers from
> prescreening and reading structures one by one for subgraph matching. Few
> thousand of IAtomContainers is fine for desktop application, but server side
> one could have multiple queries at the same time and multiply the thousands
> to unreasonable number.
>
>
> A second issue is, if the query Molecule is a common fragment in the
> database, let's assume benzene, and llike 80% of the fingerprints match, how
> do you handle that and keep performance? subgraph matches on so mnay
> structures will no perfrom well. How can you prevent that with very common
> substructures?
>
>
> We have several levels of prescreening, fingerprints only are not
> sufficient for reasonable performance.   Also precalculated aromaticity
> flags to avoid calculating that on the fly and caching of the final results.
> You could get an overview from this poster from QSAR2010
> http://www.ideaconsult.net/downloads/rhodes/posters/SMARTS.pdf .
>
> Regards,
> Nina
>
>
>
> Regards,
>
> Thomas
>
>
>
> Just my two cents.
>
> Besides prescreening, having minimum IAtomContainer objects in memory is
> the key to performance. As less than one object doesn't make sense :) one
> IATomContainer at a time is the best.  Fingerprints can be pre-calculated
> and no need to be loaded in-memory at all, let SQL do the prescreening.
>
> We've been doing similar things (CDK, relational database, no cartridges)
> in ambit (ambit.sourceforge.net) for quite few years already.  There is
> downloadable standalone application and a servlet container application war
> file (to run your own service), as well as a running OpenTox REST services
> for substructure searching , e.g.
>
>
> https://ambit.uni-plovdiv.bg:8443/ambit2/query/smarts?search=c1ccccc1[Cl,Br,F]
>
>
> http://apps.ideaconsult.net:8080/ambit2/query/smarts?search=c1ccccc1[Cl,Br,F,I]
>
> Regards,
> Nina
>
>
> Regards,
>
> Thomas
>
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
>
> ------------------------------------------------------------------------------
> What happens now with your Lotus Notes apps - do you make another costly
> upgrade, or settle for being marooned without product support? Time to move
> off Lotus Notes and onto the cloud with Force.com, apps are easier to
> build,
> use, and manage than apps on traditional platforms. Sign up for the Lotus
> Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>

------------------------------------------------------------------------------
What happens now with your Lotus Notes apps - do you make another costly 
upgrade, or settle for being marooned without product support? Time to move
off Lotus Notes and onto the cloud with Force.com, apps are easier to build,
use, and manage than apps on traditional platforms. Sign up for the Lotus 
Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to