Re: [Rdkit-discuss] warnings when exporting pandas tables with molecules to hdf

Jose Manuel Gally Mon, 18 Feb 2019 02:29:00 -0800

Dear all,

in case this is helpful for others, here is the solution I came up withby combining 2 snippets of code [1, 2]:


# init
import base64
from rdkit import Chem
n_records = 100000
file='/tmp/test.hdf'
key='test'
df = pd.DataFrame({'mol': [Chem.MolFromSmiles('C1CCCCC1')] * n_records})

# store the molecule as base64 encoding strings

df['mol'] = df['mol'].map(lambda x:base64.b64encode(pickle.dumps(x)).decode())

df.to_hdf(file, key=key)

# read the stored molecules and convert them back to molecules
df = df = pd.read_hdf(file, key=key)
df['mol'] = df['mol'].map(lambda x: Chem.Mol(base64.b64decode(x)))

This is much faster than exporting to MolBlock because there is no needfor reparsing molecules and I got rid of the Pytables warning.

With this I could even just use good old csv files instead of hdf.

Cheers,
Jose Manuel

Refs:

[1]https://github.com/rdkit/UGM_2016/blob/master/Notebooks/Pahl_NotebookTools_Tutorial.ipynb

[2] http://rdkit.blogspot.com/2016/09/avoiding-unnecessary-work-and.html



On 15.02.19 22:21, Jose Manuel Gally wrote:

Dear Peter,

thank you for your reply.

That might work for me, I'll look into it.
As a side note, if I convert the Mol into RWMol, I don't get thewarning anymore (but then I cannot read the molecules anymore...)
Cheers,
Jose Manuel

On 15.02.19 17:14, Peter St. John wrote:
you might be better off not storing the molecule RDkit objectsthemselves in the hdf file; but rather some other representation ofthe molecule. If you need 3D atom coordinates, you could callMolToMolBlock() on each of the rdkit mols, and then MolFromMolBlocklater to regenerate them. If you don't need 3D atom coordinates toget saved, SMILES strings would work well.
PyTables is expecting each entry to be something like an 'int','string', 'float64', etc. So the RDKit mol object is a fairly odddata structure for that library; and it's just warning you that itwill have to use Python's `pickle` module to serialize it.
On Fri, Feb 15, 2019 at 6:35 AM Jose Manuel Gally<jose.manuel.ga...@gmail.com <mailto:jose.manuel.ga...@gmail.com>> wrote:
    Hi all,

    I am working on some molecules in a pandas DataFrame and have to
    export
    them to a hdf file.

    This works just fine but I get a warning about Performance due to
    mixed
    types. (1)

    Why are RDKIT Mol objects causing this warning in the first
    place? Am I
    doing something wrong?

    Please find attached a small notebook with an example.

    For now I set the type of hdf to 'table', but I'm unsure this is the
    best work-around.

    Also, invoking pytest with --disable-warnings flag removes the
    message
    but the warning itself remains.

    Thanks in advance for any hindsight!

    Cheers,
    Jose Manuel

    (1) PerformanceWarning:
    your performance may suffer as PyTables will pickle object types
    that it
    cannot
    map directly to c-types [inferred_type->mixed,key->values]
    [items->None]

       return pytables.to_hdf(path_or_buf, key, self, **kwargs)

    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    <mailto:Rdkit-discuss@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] warnings when exporting pandas tables with molecules to hdf

Reply via email to