Re: [Rdkit-discuss] Source of the solubility data?

2020-02-25 Thread Greg Landrum
Hi Zhenting,

That's the Huuskonen dataset. The reference is here:
https://pubs.acs.org/doi/10.1021/ci9901338
The origins of the SDF itself are unfortunately lost in antiquity. I
originally got them here:
http://cheminformatics.org/datasets/huuskonen/index.html
but cheminformatics.org no longer exists. archive.org isn't working at the
moment, but when it's back someone could check there to try and figure out
who curated the SDF

-greg


On Mon, Feb 24, 2020 at 11:53 PM Gao Zhenting 
wrote:

> Hi Greg,
>
> I am trying to reproduce some machine learning scripts using
>
> https://github.com/rdkit/rdkit/blob/master/Docs/Book/data/solubility.test.sdf
>
>
> https://github.com/rdkit/rdkit/blob/master/Docs/Book/data/solubility.train.sdf
>
>
> What is the source of these data? How are they organized? Any citation?
>
> Best regards
> Zhenting
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-25 Thread Tim Dudgeon
I think you need to explain what benchmarks you are running and what is 
really meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL 
what size server, what settings esp. the shared_buffers setting).


A very obvious critique of what you reported is that what you describe 
as "running in Python" includes generating the fingerprints for each 
molecule on the fly, whereas for "the cartridge" these are already 
calculated, so will obviously be much faster (as the fingerprint 
generation dominates the compute).


Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:

Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a 
PoC where I've to compare RDKit in Python and RDKit on PostgreSQL. 
I've installed both and am trying some hands-on exercises to 
understand the differences. What I've understood that the structure 
searches are slower in Python (Spark Cluster) than in PostgreSQL 
database. Please correct me if I'm wrong as I'm a newbie in this and 
maybe talking silly.


The similarity search using the below functions (example) -
Python methods -

fps = 
FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))

similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing 
is very quick (in seconds) on PostgreSQL

Database functions -

select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2) 
as similarity from fingerprints join mols using (modality_id)) as fps 
where similarity between 0.45 and 0.50;


Does this conclude that for production workloads one must always use 
database cartridge only? Like RDKit, BINGO, etc.?


Regards,
DA


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Draw.MolsToGridImage error: got multiple values for keyword argument

2020-02-25 Thread Konrad Koehler via Rdkit-discuss
Hi Paolo,

 

Your solution works perfectly.  Thank you! 

 

- Konrad

 

From: Paolo Tosco 
Date: Tuesday, February 25, 2020 at 6:34 PM
To: Konrad Koehler , RDKit Discuss 

Subject: Re: [Rdkit-discuss] Draw.MolsToGridImage error: got multiple values 
for keyword argument

 

Hi Konrad,

you should use the highlightAtomLists parameter rather than highlightAtoms, 
then your example will work.

Cheers,
p.

On 25/02/2020 16:02, Konrad Koehler via Rdkit-discuss wrote:

Hi everyone,

 

I am having trouble using a mol property value to define highlighted atoms when 
generating an image.

 

Starting from the beginning, I have defined a variable highlight_atom_numbers 
as a tuple:

>>> type(highlight_atom_numbers)

>>> 

 

And set a mol property to this value:

mol.SetProp("highlight_atom_numbers",str(highlight_atom_numbers))

 

I then tried to create a 2D image of the molecule with the 
“highlight_atom_numbers” highlighted:

 

img=Draw.MolsToGridImage(

act_mols,

molsPerRow=4,

subImgSize=(200,200),

legends=[x.GetProp("_Name") for x in act_mols],

highlightAtoms=[literal_eval(x.GetProp("highlight_atom_numbers")) for x in 
act_mols] 

)

 

Which generates the following error message:

 

TypeError: Boost.Python.function() got multiple values for keyword argument 
'highlightAtoms'

 

The literal_eval function should return a single tuple and not multiple values.

 

Anyone have any ideas for getting this to work?  Thanks.

 

- Konrad

 

PS: I have tried googling for a solution, for example:

 

Stack Overflow: TypeError got multiple values for keyword argument

 

and tried the suggestions there, but that did not help.

 




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Draw.MolsToGridImage error: got multiple values for keyword argument

2020-02-25 Thread Paolo Tosco

Hi Konrad,

you should use the highlightAtomLists parameter rather than 
highlightAtoms, then your example will work.


Cheers,
p.

On 25/02/2020 16:02, Konrad Koehler via Rdkit-discuss wrote:


Hi everyone,

I am having trouble using a mol property value to define highlighted 
atoms when generating an image.


Starting from the beginning, I have defined a variable 
highlight_atom_numbers as a tuple:


>>>type(highlight_atom_numbers)

>>>

And set a mol property to this value:

mol.SetProp("highlight_atom_numbers",str(highlight_atom_numbers))

I then tried to create a 2D image of the molecule with the 
“highlight_atom_numbers” highlighted:


img=Draw.MolsToGridImage(

act_mols,

molsPerRow=4,

subImgSize=(200,200),

legends=[x.GetProp("_Name") for x in act_mols],

highlightAtoms=[literal_eval(x.GetProp("highlight_atom_numbers")) for 
x in act_mols]


)

Which generates the following error message:

TypeError: Boost.Python.function() got multiple values for keyword 
argument 'highlightAtoms'


Theliteral_eval function 
should 
return a single tuple and not multiple values.


Anyone have any ideas for getting this to work?  Thanks.

- Konrad

PS: I have tried googling for a solution, for example:

Stack Overflow: TypeError got multiple values for keyword argument 



and tried the suggestions there, but that did not help.



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Draw.MolsToGridImage error: got multiple values for keyword argument

2020-02-25 Thread Konrad Koehler via Rdkit-discuss
Hi everyone,

 

I am having trouble using a mol property value to define highlighted atoms when 
generating an image.

 

Starting from the beginning, I have defined a variable highlight_atom_numbers 
as a tuple:

>>> type(highlight_atom_numbers)

>>> 

 

And set a mol property to this value:

mol.SetProp("highlight_atom_numbers",str(highlight_atom_numbers))

 

I then tried to create a 2D image of the molecule with the 
“highlight_atom_numbers” highlighted:

 

img=Draw.MolsToGridImage(

act_mols,

molsPerRow=4,

subImgSize=(200,200),

legends=[x.GetProp("_Name") for x in act_mols],

highlightAtoms=[literal_eval(x.GetProp("highlight_atom_numbers")) for x in 
act_mols] 

)

 

Which generates the following error message:

 

TypeError: Boost.Python.function() got multiple values for keyword argument 
'highlightAtoms'

 

The literal_eval function should return a single tuple and not multiple values.

 

Anyone have any ideas for getting this to work?  Thanks.

 

- Konrad

 

PS: I have tried googling for a solution, for example:

 

Stack Overflow: TypeError got multiple values for keyword argument

 

and tried the suggestions there, but that did not help.

 

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure match when using smarts containing more than one part

2020-02-25 Thread Alexis Parenty
Hi Greg,

Yes, that 's what I was after, recursive smiles...
Thanks a lot,
Alexis

On Tue, 25 Feb 2020 at 14:53, Greg Landrum  wrote:

> Hi Alexis,
>
> There's not really. The substructure matching algorithm looks for a match
> for each atom of the query. So if your query has 8 atoms in it (as yours
> does), then it needs to match 8 separate atoms in the molecule.
>
> What exactly are you trying to match here? Do you just want to see whether
> or not a molecule has an F connected to an aromatic atom and a 6-membered
> all-carbon aromatic ring? That SMARTS is Fa.[$(c1c1)]
>
> -greg
>
> On Tue, Feb 25, 2020 at 8:46 AM Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Dear RDkiter,
>>
>> Using HasSubstructureMatch() I can match the following smarts “F[a]” and
>> “c1c1” with "Fc1c1”. However, when I put the two fragments together
>> in "F[a].c1c1" it no longer matches. I suppose this is the desired
>> behaviour since the any aromatic [a] from F[a] that is also part of
>> c1c1 and would be counted twice anotherwise.
>>
>> Is there a function parameter in HasSubstructureMatch() that I am not
>> aware of and that could make "F[a].c1c1"  match "Fc1c1” without me
>> having to separate the fragments and check the match for each part?
>>
>> Thanks,
>>
>> Alexis
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure match when using smarts containing more than one part

2020-02-25 Thread Greg Landrum
Hi Alexis,

There's not really. The substructure matching algorithm looks for a match
for each atom of the query. So if your query has 8 atoms in it (as yours
does), then it needs to match 8 separate atoms in the molecule.

What exactly are you trying to match here? Do you just want to see whether
or not a molecule has an F connected to an aromatic atom and a 6-membered
all-carbon aromatic ring? That SMARTS is Fa.[$(c1c1)]

-greg

On Tue, Feb 25, 2020 at 8:46 AM Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Dear RDkiter,
>
> Using HasSubstructureMatch() I can match the following smarts “F[a]” and
> “c1c1” with "Fc1c1”. However, when I put the two fragments together
> in "F[a].c1c1" it no longer matches. I suppose this is the desired
> behaviour since the any aromatic [a] from F[a] that is also part of
> c1c1 and would be counted twice anotherwise.
>
> Is there a function parameter in HasSubstructureMatch() that I am not
> aware of and that could make "F[a].c1c1"  match "Fc1c1” without me
> having to separate the fragments and check the match for each part?
>
> Thanks,
>
> Alexis
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure match when using smarts containing more than one part

2020-02-25 Thread Alexis Parenty
Dear RDkiter,

Using HasSubstructureMatch() I can match the following smarts “F[a]” and
“c1c1” with "Fc1c1”. However, when I put the two fragments together
in "F[a].c1c1" it no longer matches. I suppose this is the desired
behaviour since the any aromatic [a] from F[a] that is also part of
c1c1 and would be counted twice anotherwise.

Is there a function parameter in HasSubstructureMatch() that I am not aware
of and that could make "F[a].c1c1"  match "Fc1c1” without me having
to separate the fragments and check the match for each part?

Thanks,

Alexis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-25 Thread Deepti Gupta via Rdkit-discuss
Hi Gurus,
I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly.
The similarity search using the below functions (example) -Python methods -
fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))similarity = DataStructs.TanimotoSimilarity(fps1,fps2)
takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL Database functions -
select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;
Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.?
Regards,DA___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss