This is a nice solution to the problem. Thanks for sharing it!

I think there is, however a minor mistake. This line:

df['mol'] = df['mol'].map(lambda x:
base64.b64encode(pickle.dumps(x)).decode())

should be:

df['mol'] = df['mol'].map(lambda x: base64.b64encode(x.ToBinary()).decode())

You could also fix this by changing how you decode the column, but this
approach is faster.

-greg


On Mon, Feb 18, 2019 at 11:28 AM Jose Manuel Gally <
jose.manuel.ga...@gmail.com> wrote:

> Dear all,
>
> in case this is helpful for others, here is the solution I came up with by
> combining 2 snippets of code [1, 2]:
>
> # init
> import base64
> from rdkit import Chem
> n_records = 100000
> file='/tmp/test.hdf'
> key='test'
> df = pd.DataFrame({'mol': [Chem.MolFromSmiles('C1CCCCC1')] * n_records})
>
> # store the molecule as base64 encoding strings
> df['mol'] = df['mol'].map(lambda x:
> base64.b64encode(pickle.dumps(x)).decode())
> df.to_hdf(file, key=key)
>
> # read the stored molecules and convert them back to molecules
> df = df = pd.read_hdf(file, key=key)
> df['mol'] = df['mol'].map(lambda x: Chem.Mol(base64.b64decode(x)))
>
> This is much faster than exporting to MolBlock because there is no need
> for reparsing molecules and I got rid of the Pytables warning.
> With this I could even just use good old csv files instead of hdf.
>
> Cheers,
> Jose Manuel
>
> Refs:
> [1]
> https://github.com/rdkit/UGM_2016/blob/master/Notebooks/Pahl_NotebookTools_Tutorial.ipynb
> [2] http://rdkit.blogspot.com/2016/09/avoiding-unnecessary-work-and.html
>
>
>
> On 15.02.19 22:21, Jose Manuel Gally wrote:
>
> Dear Peter,
>
> thank you for your reply.
>
> That might work for me, I'll look into it.
>
> As a side note, if I convert the Mol into RWMol, I don't get the warning
> anymore (but then I cannot read the molecules anymore...)
>
> Cheers,
> Jose Manuel
> On 15.02.19 17:14, Peter St. John wrote:
>
> you might be better off not storing the molecule RDkit objects themselves
> in the hdf file; but rather some other representation of the molecule. If
> you need 3D atom coordinates, you could call MolToMolBlock() on each of the
> rdkit mols, and then MolFromMolBlock later to regenerate them. If you don't
> need 3D atom coordinates to get saved, SMILES strings would work well.
>
> PyTables is expecting each entry to be something like an 'int', 'string',
> 'float64', etc. So the RDKit mol object is a fairly odd data structure for
> that library; and it's just warning you that it will have to use Python's
> `pickle` module to serialize it.
>
> On Fri, Feb 15, 2019 at 6:35 AM Jose Manuel Gally <
> jose.manuel.ga...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am working on some molecules in a pandas DataFrame and have to export
>> them to a hdf file.
>>
>> This works just fine but I get a warning about Performance due to mixed
>> types. (1)
>>
>> Why are RDKIT Mol objects causing this warning in the first place? Am I
>> doing something wrong?
>>
>> Please find attached a small notebook with an example.
>>
>> For now I set the type of hdf to 'table', but I'm unsure this is the
>> best work-around.
>>
>> Also, invoking pytest with --disable-warnings flag removes the message
>> but the warning itself remains.
>>
>> Thanks in advance for any hindsight!
>>
>> Cheers,
>> Jose Manuel
>>
>> (1) PerformanceWarning:
>> your performance may suffer as PyTables will pickle object types that it
>> cannot
>> map directly to c-types [inferred_type->mixed,key->values] [items->None]
>>
>>    return pytables.to_hdf(path_or_buf, key, self, **kwargs)
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to