Hi Greg,

thanks for your input, this is quite faster!

Cheers,
Jose Manuel

On 21.02.19 09:48, Greg Landrum wrote:
This is a nice solution to the problem. Thanks for sharing it!

I think there is, however a minor mistake. This line:

df['mol'] = df['mol'].map(lambda x: base64.b64encode(pickle.dumps(x)).decode())

should be:

df['mol'] = df['mol'].map(lambda x: base64.b64encode(x.ToBinary()).decode())

You could also fix this by changing how you decode the column, but this approach is faster.

-greg


On Mon, Feb 18, 2019 at 11:28 AM Jose Manuel Gally <jose.manuel.ga...@gmail.com <mailto:jose.manuel.ga...@gmail.com>> wrote:

    Dear all,

    in case this is helpful for others, here is the solution I came up
    with by combining 2 snippets of code [1, 2]:

    # init
    import base64
    from rdkit import Chem
    n_records = 100000
    file='/tmp/test.hdf'
    key='test'
    df = pd.DataFrame({'mol': [Chem.MolFromSmiles('C1CCCCC1')] *
    n_records})

    # store the molecule as base64 encoding strings
    df['mol'] = df['mol'].map(lambda x:
    base64.b64encode(pickle.dumps(x)).decode())
    df.to_hdf(file, key=key)

    # read the stored molecules and convert them back to molecules
    df = df = pd.read_hdf(file, key=key)
    df['mol'] = df['mol'].map(lambda x: Chem.Mol(base64.b64decode(x)))

    This is much faster than exporting to MolBlock because there is no
    need for reparsing molecules and I got rid of the Pytables warning.
    With this I could even just use good old csv files instead of hdf.

    Cheers,
    Jose Manuel

    Refs:
    [1]
    
https://github.com/rdkit/UGM_2016/blob/master/Notebooks/Pahl_NotebookTools_Tutorial.ipynb
    [2]
    http://rdkit.blogspot.com/2016/09/avoiding-unnecessary-work-and.html



    On 15.02.19 22:21, Jose Manuel Gally wrote:

    Dear Peter,

    thank you for your reply.

    That might work for me, I'll look into it.

    As a side note, if I convert the Mol into RWMol, I don't get the
    warning anymore (but then I cannot read the molecules anymore...)

    Cheers,
    Jose Manuel

    On 15.02.19 17:14, Peter St. John wrote:
    you might be better off not storing the molecule RDkit objects
    themselves in the hdf file; but rather some other representation
    of the molecule. If you need 3D atom coordinates, you could call
    MolToMolBlock() on each of the rdkit mols, and then
    MolFromMolBlock later to regenerate them. If you don't need 3D
    atom coordinates to get saved, SMILES strings would work well.

    PyTables is expecting each entry to be something like an 'int',
    'string', 'float64', etc. So the RDKit mol object is a fairly
    odd data structure for that library; and it's just warning you
    that it will have to use Python's `pickle` module to serialize it.

    On Fri, Feb 15, 2019 at 6:35 AM Jose Manuel Gally
    <jose.manuel.ga...@gmail.com
    <mailto:jose.manuel.ga...@gmail.com>> wrote:

        Hi all,

        I am working on some molecules in a pandas DataFrame and
        have to export
        them to a hdf file.

        This works just fine but I get a warning about Performance
        due to mixed
        types. (1)

        Why are RDKIT Mol objects causing this warning in the first
        place? Am I
        doing something wrong?

        Please find attached a small notebook with an example.

        For now I set the type of hdf to 'table', but I'm unsure
        this is the
        best work-around.

        Also, invoking pytest with --disable-warnings flag removes
        the message
        but the warning itself remains.

        Thanks in advance for any hindsight!

        Cheers,
        Jose Manuel

        (1) PerformanceWarning:
        your performance may suffer as PyTables will pickle object
        types that it
        cannot
        map directly to c-types [inferred_type->mixed,key->values]
        [items->None]

           return pytables.to_hdf(path_or_buf, key, self, **kwargs)

        _______________________________________________
        Rdkit-discuss mailing list
        Rdkit-discuss@lists.sourceforge.net
        <mailto:Rdkit-discuss@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    <mailto:Rdkit-discuss@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to