Re: [msvc] Management of large vectors

Juan Carlos Tue, 26 Oct 2004 15:00:52 -0700

Hi,

It is great to see Iâm not the only chemist in the list :-) Many thanks for your comments Daniel, as always, they are very helpful.

The high number of molecules (mostly proteins) is the result of molecular dynamics calculations. Latest dynamics we have run produced about 28000 conformers. Our application (which Iâm not writing myself but a colleague of mine) computes NMR properties such as NOE effects, cross-correlation, 3J scalar couplings and residual dipolar coupling constants.

The molecules are in PDB format (and sometimes in MOL format) and if I remember correctly (Iâll have to check this out with my colleague tomorrow), they are saved in memory as a class object that contains connectivity information. Some like this:

struct bond

{

int index;

bondtype order;

};

struct Atom

{

coordinate c;

std::vector<bond> conexion;

};

struct CMolecule

{

std::vector<Atom> m_atom;

};

In the program we have classes rather than structs. So the program reads the pdb files (normally a single PDB file with all the conformers embedded in the same file) and creates the list of molecules. Some calculations have to be applied to all the conformers whilst other can be applied to single conformers (e.g. the currently displayed one). We also need to allow the user to see all the conformers displayed one over the other.

I have no experience with SMILES yet I have some papers (Journal of Chemical Information & Computer Science) with the description of this format. I havenât heard of CORINA before, thanks for this information.

I will study your suggestions more carefully to see which one best fits our requirements.

Thanks again,

Carlos

----- Original Message -----

From: Daniel Robinson

To: [EMAIL PROTECTED]

Sent: Tuesday, October 26, 2004 10:43 PM

Subject: RE: [msvc] Management of large vectors

I apologise to other members of the list if this post goes a little bit off topic...

Several things immediately come to my mind...

1) What are you doing with this many molecules? This question is based somewhat out of professional curiosity as I'm computational chemist (amongst other things), but also as I'm trying to understand the problem you are tackling.

2) Are these molecular structures related to one another in anyway? For example, beyond a simple database application, I can't think of many reasons why you would need to have 20,000+ completely distinct molecules loaded in memory at one time. Conversely I could easily see how you might want to store multiple conformations of the same molecule. The different conformations would only require you to store the torsion angle (difference) information for each conformer, which is quite compact information. The larger data of the atom types, charges, and connectivity information only needs to be stored once. Using this approach you can gain substantial memory savings.

3) How are these molecules being represented, and is the representation that you are using optimal? Usually when dealing with large numbers of molecules a very compact representation of molecular structure (such as a SMILES string) is used. Handling tens of thousands of SMILES strings is trivial and will easily fit into memory. The SMILES string representation can be converted to connectivity information using standard routines (such as those found in the OpenBabel library) or even into reasonable 3D-structures quite quickly by CORINA.

4) How is your vector of molecules being accessed? Beyond databases I can't think of many applications that would require more than one or two molecules to be considered at a time, so having 20,000+ loaded simultaneously seems excessive! In your own proposed solution (of reading in the molecules as a block, using them, freeing them, and reading in the next block) you seem to be suggesting that the processing is quite linear. If that is the case could you not make the block size as small as one molecule and instead of getting the 'CMolecule' from the vector, access it straight out of the file where it is stored. You should always have enough memory for this!

5) What file format are your molecules stored in on disk? Although I've done it myself, leaving your 20,000 molecules as a SD file is not the most optimal method of storing large numbers of structures! Storing them in an easily serialized binary format is much more efficient. This is related to my next and final point...

6) What features of std::vector<> are you using? And what does a CMolecule look like in memory? Consider the situation where your molecules are stored on disk in a format, that is identical to the format that is going to be used in memory (similar to how bitmap data is stored in a BMP file). You can then 'load' your entire database of 20,000+ molecules into memory by simply mapping the file into your application's address space. In principle this will give you the potential to have a database of up to 2GB apparently 'in memory' without having to have the 2GB of physical RAM to back the allocation. I believe that this is the methodology used by the Chemical Computing Group's MOE package, which I've used to store and manipulate databases of over 3,000,000 (small) molecules with absolutely no problems whatsoever. Of course if you really need std::vector<> functionality this approach become a lot more involved...

Anyway those were just some thoughts..

Kind regards

Daniel

[Daniel Robinson] -----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]On Behalf Of Juan Carlos
Sent: 26 October 2004 20:18
To: [EMAIL PROTECTED]
Subject: [msvc] Management of large vectors

Hi,

Iâm writing an application that performs complex computations on biomolecules. The application needs to read a number of different structures (= molecules) from disk and then perform the necessary calculations. These molecules are held in a stl vector, for example:

std::vector <CMolecule> m_vMolecules;

This works fine as long as the vector doesnât exceed the RAM memory. Unfortunately, this is not the general case because the number of molecules to be read is typically higher than 20,000 and so, when the RAM memory is not enough, the applicationâs efficiency decreases dramatically. Perhaps I could try to read only the molecules that donât go above the memory and when the rest of the molecules are needed, release the first chunk and read a new one. Is this a good approach? If so, any help on the best way to implement this scheme?

Thanking in advance,

Carlos

_______________________________________________
msvc mailing list
[EMAIL PROTECTED]
See http://beginthread.com/mailman/listinfo/msvc_beginthread.com for subscription changes, and list archive.

_______________________________________________
msvc mailing list
[EMAIL PROTECTED]
See http://beginthread.com/mailman/listinfo/msvc_beginthread.com for subscription 
changes, and list archive.

Re: [msvc] Management of large vectors

Reply via email to