Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Nina Jeliazkova Tue, 14 Dec 2010 11:51:39 -0800

On 7 December 2010 22:23, Thomas Strunz <beginn...@hotmail.de> wrote:


>  Hi all,
>
> had some time again this evening an experimented a bit with ehcache (
> www.ehcache.org). Got it working pretty quickly and did some "test". In
> general the second run after cache is populated is about 10x times faster
> using UIT. Did not test with new Isomorphism class yet. Of course having all
> Molecules in cache is pretty bad for memory consumption but it confirms that
> dataaccess and not searching is the limiting part.
>
> And I realized that it must not be the DB that is slow but the actually
> creation of IAtomContainers from Molfiles which happens in same Thread.
>

That's what I was thinking as well :)


> Maybe I should try multiple threads that create the molecules from the
> molfiles.
>
Or I could convert them to Smiles and try it out then. Maybe SimlesParser is
> faster.
>
>
Not sure about Smiles parser being faster .


> Of course just limiting results would be easiest but then were is the fun
> in that?  ;)
>
>
Well, optimizing the CDK code for IAtomContainers creation and aromaticity
detection could be lot of fun :)

Regards,
Nina


> Regards,
>
> Thomas
>
> ------------------------------
> Date: Mon, 6 Dec 2010 19:21:42 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
> Hi Thomas,
>
> On 6 December 2010 18:55, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Nina,
>
> I adjustedthe method that gets the molecule. Each molecule is now selected
> separatley with a separate SQL Statment. Since the resultSet is now much
> smallermemory consumption went down from like 800 Mb to 60 mb. So that
> helped a lot. Thanks!
>
>
> Great!
>
>
> The downside is that peformance is now about 15 % slower. So this confirms
> once more that the I/O is the limiting operation and not the subgraph
> matching.
>
> I had a look at the commercial version and what it saves to a Derby DB for
> desktop version. Nothing special at all. I also read on their homepage and
> they claim to have everything (fingerprint and molecule in a cache = in
> memory).
> Probaly that's the wa to go, keep more stuff in memory.
>
>
> Yes, only that objects in memory should be as small as possible,
> unfortunately atom containers are too heavy currently.
>
> Don't have experience with Java databases, but MySQL has plenty of options
> that could be tuned and considerably change the performance, may be it is
> similar with Derby / HSQLDB.
>
> Best regards,
> Nina
>
>
> Thanks all for your help
>
> Regards,
>
> Thomas
>
> ------------------------------
> Date: Sun, 5 Dec 2010 10:16:42 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
>
>
> On 4 December 2010 21:10, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Nina,
>
> you are right. If you use these java-based RDBMS like HSQLDB, the complete
> result set is loaded into memory. I think the same would be true for MySQL
> but not for oracle or SQL Server.
>
>
> I think for MySQL it depends on the type of the cursor (forward-only) and
> various other settings.
>
>
> I read it in hsqldb manual, hsqldb does not have a "server side cursor". I
> tried this by using the LIMIT clause, it works put perfroms very poorly. I
> will try it out, but doing thousand(s) of select statement that each return
> 1 result does not seem very nice at first though but probably will perfrom
> better.
>
>
> It does (with MySQL at least).
>
>
>
> I limit the LinkedBlockingQueue so it does not have an unlimited capacity.
> But setting the limit to 1 , 100 or anything in between leads to about the
> same result, unlimited does lead to even higher memory consumption.
> For a query that returns 30k hits, it consumes about 800mb with a limit on
> the BlockingQueue and 2 GB without (-Xmx2048m so can't use more than 2 gb),
> using Hsqldb as Server which also consumes another 1.5 GB...
>
>
> What about dropping the queue and doing isomorphism test in the same loop ,
> reusing just single atom container object ?
>
> Without any experience with HSQLDB, it's hard to guess further,  it is not
> clear if the bottleneck is the database access, or something else.
>
> Best regards,
> Nina
>
>
>
> Good evening,
>
> Regards,
>
> Thomas
>
>
>
>
> ------------------------------
> Date: Sat, 4 Dec 2010 17:18:20 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
> Hi,
>
> On 4 December 2010 16:41, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Nina,
>
> thanks for your fast response. I will probaly start storing certain
> properties in the DB because as you mentioned Serialization has it's
> drawbacks. I would of course keep the MolFile but then they must be kept
> in-sync from the application.
> About thread-safe: i don't do anything speical just use thread to put or
> take items out of a BlockingQueue so that if as Example screening found a
> hit I can instantly get it from the DB and start subgraph matching while
> screening continues.
>
> See code at end of message. The issue might be that I use "IN()" in my
> statments. So a usual statment will look like:
>
> SELECT molid, molecule FROM moltable WHERE molid IN(<List of molids>)
>
> Maybe the IN statment with a large list is bad? Or creating the statment (I
> do use stringbuilder for it).
>
>
> "IN" might be bad indeed, here SQL EXPLAIN  could help.
>
> From my experience, loading all molecule fields  with a single SQL query is
> not good for performance. Since all the blobs actually are loaded in memory
> by the ResultSet, it will lead to Out of memory errors with sufficiently
> large number of compounds.
>
> Also, LinkedBlockingQueue has in practice infinite length (The capacity, if
> unspecified, is equal to Integer.MAX_VALUE ), which means if the isomorphism
> procedure is not fast enough to empty the queue, you will have lot of
> molecules twice in memory (once as blobs from ResultSet and once as
> IAtomContainers in the queue.
>
> I might be missing something, running a profiler might tell you exactly
> where the problem is.
>
> Hope this helps,
> Nina
>
> But this seemed the simplest way to get a list of items.
>
> Regards,
>
> Thomas
>
> Code:
> filter is empty string or " WHERE molid IN(<list of molids>)"
>
>     private int getMoleculesDefault(
>             LinkedBlockingQueue<IMolecule> queue, String filter)
>             throws SQLException {
>
>         ResultSet resultSet = null;
>         Connection connection = getConnection();
>         Statement stmt = connection.createStatement();
>         String sqlSelect = getSelectMoleculesStatement()
>                 + filter;
>         getLogger().debug(sqlSelect);
>         resultSet = stmt.executeQuery(sqlSelect);
>         resultSet.setFetchSize(getFetchSize());
>         int counter = 0;
>         try {
>             counter = processResultSet(queue, resultSet);
>             return counter;
>         } catch (InterruptedException ex) {
>             getLogger().catching(ex);
>             getLogger().exit(counter);
>             return counter;
>         } finally {
>             if (connection != null) {
>                 connection.close();
>             }
>         }
>     }
>
> And:
>
>     protected int processResultSet(LinkedBlockingQueue<IMolecule> queue,
>             ResultSet resultSet)
>             throws SQLException, InterruptedException {
>         int counter = 0;
>         while (resultSet.next()) {
>             Integer id = resultSet.getInt(getMolIdColumnName());
>             InputStream stream =
> resultSet.getAsciiStream(getStructureColumnName());
>             try {
>
>                 MDLV2000Reader molReader = new MDLV2000Reader(stream);
>                 Molecule mol = (Molecule) molReader.read((ChemObject) new
> Molecule());
>
> AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol);
>                 CDKHueckelAromaticityDetector.detectAromaticity(mol);
>                 mol.setID(id.toString());
>                 getLogger().trace(mol.getID());
>                 queue.put(mol);
>                 counter++;
>             } catch (CDKException cdkEx) {
>                 getLogger().catching(cdkEx);
>             }
>         }
>         return counter;
>     }
>
>
> ------------------------------
> Date: Sat, 4 Dec 2010 16:04:49 +0200
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> From: jeliazkova.n...@gmail.com
> To: beginn...@hotmail.de
> CC: j.kerssemak...@cmbi.ru.nl; cdk-user@lists.sourceforge.net
>
>
> Hi Thomas,
>
> On 4 December 2010 15:47, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi all,
>
> first some questions:
> Can I set aromaticty of a Molecule manually? There is a setFlag(int
> flag_type) method but did not find a list of flag_types.
> That would prevent from having to perceive it each time a molecule is
> created and if I look at Isomorphism Class, this flag has an effect on
> subgraph matching. True?
>
>
> Yes, the flag is CDKConstants.ISAROMATIC .  These are set by Isomorphism
> class, but can be set manually as well, for example by reading precalculated
> atom and bond aromatic flags from the database (I can confirm this increases
> the performance).
>
>
>
> Observations:
> My search currently seems limited by data access and not the graph matching
> itself. I was able to solve the blocking/freezing issue (had nothing to do
> with threading just a logical error elsewhere) and using visualVm I can see
> that the thread doing the database access is the most active one (running
> 100% of the time). The graph matching thread is runnign only 20% of the
> time.
> Using a BlockingQueue has benefits but not on memory usage in this case. I
> can set it's size to 1 or 100 without an effect on memory consumption. As
> indicated Jules probably due to delayed garbage collection. Or said
> otherwise when doing a search with lots of hits you will always have a lot
> of IAtomContainer in memory, regardless of your algorithm.
>
>
> This seems to be specific to your implementation, so I am not sure what to
> say without seeing the code. We have achieved quite reasonable performance
> by storing molecules as MOL files in MySQL. I assume you are using database
> connection pool? 100% utilization of the database access thread may also
> mean non-optimized SQL query.
>
> On another note, one thread to read from DB and another thread to run CDK
> code is not a very good idea (at least currently), since there are lot of
> CDK classes, which are not thread safe.
>
>
>
> (this probably also explains why I see no benefit in using Isomorphism
> class from cdk-1.3.7 over UniversialIsomorphismTester)
>
> The conclusion is to either have more threads reading from the database or
> to serialize all molecules to the database.
>
>
> Java class serialization has the drawback that usually the serialization
> will change with slight changes of the class implementation, making
> impossible to read the molecules in the database, if the underlying library
> changes.
>
> Best regards,
> Nina
>
>
> Best Regards,
>
>
> Thomas
>
> ------------------------------
> From: j.kerssemak...@cmbi.ru.nl
> Date: Thu, 2 Dec 2010 15:44:55 +0100
>
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> To: beginn...@hotmail.de
> CC: cdk-user@lists.sourceforge.net
>
> Hello Thomas (and other CDK-users of course)
>
> - The aromaticity-flag could very well be in the fingerprint, I've never
> really checked that to be honest. Does anyone else know?
> The original reasoning for putting it in was that the aromaticity-detection
> is 'expensive' if you have to do it each time for a query, but it still is a
> pretty good distinction (in a general metabolite database in any case, if
> your database is 99% aromatic anyway, don't even bother ;-) ). We can
> eliminate about half our dataset by that flag so it works pretty well
>
> - The CDKHueckelAromaticityDetector does indeed modify the atomcontainer,
> setting the ISAROMATIC flags for aromatic molecules. I don't know if it does
> anything for the bond orders though..
>
> - A serialized object is a native java object, which can be directly
> unpacked into memory (fast). A molfile needs to be parsed by the molreader,
> which will always be slower. Reading from database or from disk will
> probably not matter that much in this case. varchars are stored in different
> data-blocks than the rest of the row for (I think) all database systems, so
> the database is unlikely to have these cached and therefore will still need
> to read them from disk (though from a database file rather than a normal
> directory. The performance will be about equal)
> Of course, this doesn't apply if you have your complete database in memory,
> which hsql does, I believe. Reading from memory is always way faster than
> reading from disk.
>
> - If you have optimised your method already to only load the atomcontainer
> inside the for-loop, you will always only have one IAtomcontainer per thread
> in memory(*) per running loop. If you limit the amount of queries running at
> the same time, this should solve your problem. I'm horrible at threaded
> programming, and from what I understand, so is every other human being, so
> my guess is that the freezes/blocks you see stem from the threaded part, not
> from the memory-overrun part.
>
> (*): not strictly true, the default Java-servers only start removing things
> from memory once memory becomes scarce, so there will probably be a few
> atomcontainers from previous loop iterations lying around waiting to be
> tossed away by the garbage collector.
>
> - The list of often-occuring fragments will probably not do you much good.
> Such a list is probably going to be bigger than your whole database after a
> few days/weeks of user-interactions.
> The cleaner, most often used method is (as you also suggest) to tell the
> user "This query structure generated to many preliminary hits for this
> server to handle, please be more specific."
>
> You're welcome :-)
> Best regards,
> Jules
>
> On 1 December 2010 17:54, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Jules,
>
> thank you for this detailed explanations. I have some additional questions:
>
> - Aromaticity flag: so you store a boolean value yes/No? What's the
> advantage of this? or otherwise said should that not already be covered by
> the fingerprint?
>
> - CDKHueckelAromaticityDetector: does this modify the passed in
> IAtomContainer or just returns true/false? (eg. in smsd.Isomorphism.init
> method set flag for cleanAndConfigureMolecule to false if it was already
> done after reading the molecule?
>
> - About loading from molfiles: is it much slower than loading a serialized
> molecule? Assuming the molfile is in the database.
>
> I'm already working on a method on putting the molecules in a
> BlockingQueue, e.g. have one thread read from the database and 1 (or more)
> others doing the subgraph matching. Like that I could limit the amount of
> IAtomContainers in memory. memory usage is reduced dramatically but still
> have issues (=test freezes/blocks and does not return).
>
> I agree that I need some additional steps but the example with benzene
> mentioned remains. If the query fragment is very common, any additional
> steps won't help because there actually are so many structures that match.
> My idea is to add a table with such fragments and a table with Molecules
> that contain that fragment. The fragment can be stored as canonical smiles
> or InchiKey (or an other canonical form). The first step could then be to
> check if the query is such a fragment and select all the matching Id's
> (Select with 1 join). So subgraph matching can be avoided. This could be
> enhanced by automatically adding query fragments that return to many hits
> ("the system learns").
> It could also be done otherwise liek do fingerprint screen first and only
> do this "fragment" screen if there are too many hits after fingerprinting.
> A simpler approach would be to just always only return a defined number of
> hits like max 500 and ask the user to more clearly define the query
> structure.
>
> Thanks for your help,
>
> Thomas
>
>
> ------------------------------
> From: j.kerssemak...@cmbi.ru.nl
> Date: Wed, 1 Dec 2010 12:10:28 +0100
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> To: beginn...@hotmail.de
> CC: cdk-user@lists.sourceforge.net
>
>
> Hello All,
>
> We've been using the CDK substructure search for a while now too in our
> biometa database (
> http://cheminf.cmbi.ru.nl/cgi-bin/biometa/biometa.py?molecules%20jme).
> Here is what we do to keep things manageable performanace-wise:
>
> * Pre-calculate several statistics for all entries, namely:
>  - fingerprints
>  - an aromaticity flag
>  - amount of different elements
>  - number of rings (either aromatic or non-aromatic)
>  - total amount of atoms
>  - total amount of of atoms per element (in separate table of (molecule_id,
> element_number, element_count))
> * when querying, we calculate the same properties for the search-molecule
> and then write a pretty long SQL-query that limits the results as much as
> possible as cheaply as possible:
>  - first condition is the aromatic/non-aromatic flag (single flag
> comparison --> cheapest you'll ever find)
>  - next condition is element-count, (a simple, cheap numerical
> '>='-comparison)
>  - then ringcount >=
>  - then the fingerprint comparison. We let the database do the logical AND
> and ==, because postgres has native bit-array operations, which our
> python-binding (pgsql) can't handle because it doesn't understand bit-arrays
> (it converts them to strings).
>  - finally a per-element atom-count comparison. This is last because in the
> current set-up it has to join the element-count table to the molecule table,
> which is probably slower.
> * The mol-files for the resultant mol_id's are then concatenated into a
> large sdf-file which is fed through a CDK SDFSubstructureFinder from a very
> old 2006 SVN version (dropped since, but I dare not replace it because
> everything will probably implode if I try).
>
> It's all a bit hack-ish, but it works fairly well since we don't have much
> traffic. I never actually tested how the postgres query planner handles
> this, but it is fairly smart.
>
> Things that I would change if I rewrote it all today would be:
> 1) store serialized IAtomContainers in the database, to prevent having to
> re-read the molfiles from disk every time
> 2) Use a faster substructure matcher (SMSD sounds good)
> 3) get rid of the element-count table, and rather add columns for our
> most-prevalent atom-types to the molecule rows (namely: C,N,O,H,P). This
> avoids the overhead of the join.
>
>
> Thomas, as Nina already mentioned, you shouldn't load all the molecules
> before the loop, rather load them one-by-one IN the loop. This means you
> only ever need one molecule in-memory per query, which saves MASSIVELY on
> memory requirements.
>
>
> //Pseudocode example of what you have:
> keyList = doTheSearch(); // get molecule ID's for potential candidates
> map<key, molecule> theMap = manager.getMolecules(keyList);  // This loads
> ALL your molecules into memory, OUCH!!
> for (key in map.keyset) {
>   mol = theMap.get(key);
>   if (searchtarget.isSubgraphOf(mol)) {
>     results.add(mol)
>   }
> }
>
> // Pseudocode for a more efficient way to do it (don't pre-load all
> molecules):
> keyList = doTheSearch(); // get molecule ID's for potential candidates
> for (key in keyList) {
>   mol = manager.getMolecule(key); // IMPROVEMENT: only load one molecule at
> a time!
>   if (searchtarget.isSubgraphOf(mol)) {
>     results.add(mol)
>   }
> }
>
> Hope this helps,
> Best regards,
> Jules Kerssemakers
>
> On 30 November 2010 21:19, Nina Jeliazkova <jeliazkova.n...@gmail.com>wrote:
>
> Hi Thomas,
>
> On 30 November 2010 21:58, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>  Hi Nina,
>
> I sure have more than 1 IAtomContainer in memory at time so I agree that
> might be an issue but if screening lets say returns 1000 hits, 1000 subgraph
> matches must be done and hence all the 1000 Molecules must be created first.
> So you would suggest to read each one separatley from database after a
> subgraph match returns?
>
>
> What  we are doing is getting database structure identifiers from
> prescreening and reading structures one by one for subgraph matching. Few
> thousand of IAtomContainers is fine for desktop application, but server side
> one could have multiple queries at the same time and multiply the thousands
> to unreasonable number.
>
>
> A second issue is, if the query Molecule is a common fragment in the
> database, let's assume benzene, and llike 80% of the fingerprints match, how
> do you handle that and keep performance? subgraph matches on so mnay
> structures will no perfrom well. How can you prevent that with very common
> substructures?
>
>
> We have several levels of prescreening, fingerprints only are not
> sufficient for reasonable performance.   Also precalculated aromaticity
> flags to avoid calculating that on the fly and caching of the final results.
> You could get an overview from this poster from QSAR2010
> http://www.ideaconsult.net/downloads/rhodes/posters/SMARTS.pdf .
>
> Regards,
> Nina
>
>

------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to