This is a nice solution to the problem. Thanks for sharing it! I think there is, however a minor mistake. This line:
df['mol'] = df['mol'].map(lambda x: base64.b64encode(pickle.dumps(x)).decode()) should be: df['mol'] = df['mol'].map(lambda x: base64.b64encode(x.ToBinary()).decode()) You could also fix this by changing how you decode the column, but this approach is faster. -greg On Mon, Feb 18, 2019 at 11:28 AM Jose Manuel Gally < jose.manuel.ga...@gmail.com> wrote: > Dear all, > > in case this is helpful for others, here is the solution I came up with by > combining 2 snippets of code [1, 2]: > > # init > import base64 > from rdkit import Chem > n_records = 100000 > file='/tmp/test.hdf' > key='test' > df = pd.DataFrame({'mol': [Chem.MolFromSmiles('C1CCCCC1')] * n_records}) > > # store the molecule as base64 encoding strings > df['mol'] = df['mol'].map(lambda x: > base64.b64encode(pickle.dumps(x)).decode()) > df.to_hdf(file, key=key) > > # read the stored molecules and convert them back to molecules > df = df = pd.read_hdf(file, key=key) > df['mol'] = df['mol'].map(lambda x: Chem.Mol(base64.b64decode(x))) > > This is much faster than exporting to MolBlock because there is no need > for reparsing molecules and I got rid of the Pytables warning. > With this I could even just use good old csv files instead of hdf. > > Cheers, > Jose Manuel > > Refs: > [1] > https://github.com/rdkit/UGM_2016/blob/master/Notebooks/Pahl_NotebookTools_Tutorial.ipynb > [2] http://rdkit.blogspot.com/2016/09/avoiding-unnecessary-work-and.html > > > > On 15.02.19 22:21, Jose Manuel Gally wrote: > > Dear Peter, > > thank you for your reply. > > That might work for me, I'll look into it. > > As a side note, if I convert the Mol into RWMol, I don't get the warning > anymore (but then I cannot read the molecules anymore...) > > Cheers, > Jose Manuel > On 15.02.19 17:14, Peter St. John wrote: > > you might be better off not storing the molecule RDkit objects themselves > in the hdf file; but rather some other representation of the molecule. If > you need 3D atom coordinates, you could call MolToMolBlock() on each of the > rdkit mols, and then MolFromMolBlock later to regenerate them. If you don't > need 3D atom coordinates to get saved, SMILES strings would work well. > > PyTables is expecting each entry to be something like an 'int', 'string', > 'float64', etc. So the RDKit mol object is a fairly odd data structure for > that library; and it's just warning you that it will have to use Python's > `pickle` module to serialize it. > > On Fri, Feb 15, 2019 at 6:35 AM Jose Manuel Gally < > jose.manuel.ga...@gmail.com> wrote: > >> Hi all, >> >> I am working on some molecules in a pandas DataFrame and have to export >> them to a hdf file. >> >> This works just fine but I get a warning about Performance due to mixed >> types. (1) >> >> Why are RDKIT Mol objects causing this warning in the first place? Am I >> doing something wrong? >> >> Please find attached a small notebook with an example. >> >> For now I set the type of hdf to 'table', but I'm unsure this is the >> best work-around. >> >> Also, invoking pytest with --disable-warnings flag removes the message >> but the warning itself remains. >> >> Thanks in advance for any hindsight! >> >> Cheers, >> Jose Manuel >> >> (1) PerformanceWarning: >> your performance may suffer as PyTables will pickle object types that it >> cannot >> map directly to c-types [inferred_type->mixed,key->values] [items->None] >> >> return pytables.to_hdf(path_or_buf, key, self, **kwargs) >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss