Re: [msvc] Management of large vectors

Paul Grenyer Tue, 26 Oct 2004 15:08:40 -0700

My girlfriend is a chemist too, but that's as close as I get. :-)


Regards
Paul

Paul Grenyer
email: [EMAIL PROTECTED]
web: http://www.paulgrenyer.co.uk
articles: http://www.paulgrenyer.dyndns.org/articles/

Jensen should have gone to Williams.
Ecclestone is killing the sport.

----- Original Message ----- 
From: "Juan Carlos" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, October 26, 2004 11:14 PM
Subject: Re: [msvc] Management of large vectors


Hi,

It is great to see Iâm not the only chemist in the list :-) Many thanks for
your comments Daniel, as always, they are very helpful.

The high number of molecules (mostly proteins) is the result of molecular
dynamics calculations. Latest dynamics we have run produced about 28000
conformers. Our application (which Iâm not writing myself but a colleague of
mine) computes NMR properties such as NOE effects, cross-correlation, 3J
scalar couplings and residual dipolar coupling constants.



The molecules are in PDB format (and sometimes in MOL format) and if I
remember correctly (Iâll have to check this out with my colleague tomorrow),
they are saved in memory as a class object that contains connectivity
information. Some like this:



struct bond

{

            int index;

            bondtype order;



};

struct Atom

{

            coordinate c;

            std::vector<bond> conexion;

};



struct CMolecule

{

            std::vector<Atom> m_atom;

};



In the program we have classes rather than structs. So the program reads the
pdb files (normally a single PDB file with all the conformers embedded in
the same file) and creates the list of molecules. Some calculations have to
be applied to all the conformers whilst other can be applied to single
conformers (e.g. the currently displayed one). We also need to allow the
user to see all the conformers displayed one over the other.

I have no experience with SMILES yet I have some papers (Journal of Chemical
Information & Computer Science) with the description of this format. I haven
ât heard of CORINA before, thanks for this information.

I will study your suggestions more carefully to see which one best fits our
requirements.

Thanks again,

Carlos

  ----- Original Message ----- 
  From: Daniel Robinson
  To: [EMAIL PROTECTED]
  Sent: Tuesday, October 26, 2004 10:43 PM
  Subject: RE: [msvc] Management of large vectors


  I apologise to other members of the list if this post goes a little bit
off topic...

  Several things immediately come to my mind...

  1) What are you doing with this many molecules? This question is based
somewhat out of professional curiosity as I'm computational chemist (amongst
other things), but also as I'm trying to understand the problem you are
tackling.

  2) Are these molecular structures related to one another in anyway? For
example, beyond a simple database application, I can't think of many reasons
why you would need to have 20,000+ completely distinct molecules loaded in
memory at one time. Conversely I could easily see how you might want to
store multiple conformations of the same molecule. The different
conformations would only require you to store the torsion angle (difference)
information for each conformer, which is quite compact information. The
larger data of the atom types, charges, and connectivity information only
needs to be stored once. Using this approach you can gain substantial memory
savings.

  3) How are these molecules being represented, and is the representation
that you are using optimal? Usually when dealing with large numbers of
molecules a very compact representation of molecular structure (such as a
SMILES string) is used. Handling tens of thousands of SMILES strings is
trivial and will easily fit into memory. The SMILES string representation
can be converted to connectivity information using standard routines (such
as those found in the OpenBabel library) or even into reasonable
3D-structures quite quickly by CORINA.

  4) How is your vector of molecules being accessed? Beyond databases I
can't think of many applications that would require more than one or two
molecules to be considered at a time, so having 20,000+ loaded
simultaneously seems excessive! In your own proposed solution (of reading in
the molecules as a block, using them, freeing them, and reading in the next
block) you seem to be suggesting that the processing is quite linear. If
that is the case could you not make the block size as small as one molecule
and instead of getting the 'CMolecule' from the vector, access it straight
out of the file where it is stored. You should always have enough memory for
this!

  5) What file format are your molecules stored in on disk? Although I've
done it myself, leaving your 20,000 molecules as a SD file is not the most
optimal method of storing large numbers of structures! Storing them in an
easily serialized binary format is much more efficient. This is related to
my next and final point...

  6) What features of std::vector<> are you using? And what does a CMolecule
look like in memory? Consider the situation where your molecules are stored
on disk in a format, that is identical to the format that is going to be
used in memory (similar to how bitmap data is stored in a BMP file). You can
then 'load' your entire database of 20,000+ molecules into memory by simply
mapping the file into your application's address space. In principle this
will give you the potential to have a database of up to 2GB apparently 'in
memory' without having to have the 2GB of physical RAM to back the
allocation. I believe that this is the methodology used by the Chemical
Computing Group's MOE package, which I've used to store and manipulate
databases of over 3,000,000 (small) molecules with absolutely no problems
whatsoever. Of course if you really need std::vector<> functionality this
approach become a lot more involved...

  Anyway those were just some thoughts..

  Kind regards

  Daniel


  [Daniel Robinson]  -----Original Message-----
  From: [EMAIL PROTECTED] read.com[mailto:[EMAIL PROTECTED]
d.com]OnBehalfOfJuanCarlos
  Sent: 26 October 2004 20:18
  To: [EMAIL PROTECTED]
  Subject: [msvc] Management of large vectors


    Hi,



    Iâm writing an application that performs complex computations on
biomolecules. The application needs to read a number of different structures
(= molecules) from disk and then perform the necessary calculations.  These
molecules are held in a stl vector, for example:

    std::vector <CMolecule> m_vMolecules;



    This works fine as long as the vector doesnât exceed the RAM memory.
Unfortunately, this is not the general case because the number of molecules
to be read is typically higher than 20,000 and so, when the RAM memory is
not enough, the applicationâs efficiency decreases dramatically. Perhaps I
could try to read only the molecules that donât go above the memory and when
the rest of the molecules are needed, release the first chunk and read a new
one. Is this a good approach? If so, any help on the best way to implement
this scheme?



    Thanking in advance,

    Carlos



----------------------------------------------------------------------------
--


  _______________________________________________
  msvc mailing list
  [EMAIL PROTECTED]
  See http://beginthr
ead.commailmanlistinfomsvc_beginthread.comforsubscription changes, and list
archive.



----------------------------------------------------------------------------
----


> _______________________________________________
> msvc mailing list
> [EMAIL PROTECTED]
> See http://beginthread.
commailmanlistinfomsvc_beginthread.comforsubscriptionchanges,andlistarchive.
>


_______________________________________________
msvc mailing list
[EMAIL PROTECTED]
See http://beginthread.com/mailman/listinfo/msvc_beginthread.com for subscription 
changes, and list archive.

Re: [msvc] Management of large vectors

Reply via email to