date:20200226

Re: [Rdkit-discuss] RDKit in C++

2020-02-26 Thread topgunhaides .

Hey Paolo and David,

Thanks a lot!
This is probably the most helpful resource I can use. It is great that you
are planning to add new stuff in there and update things.

One reason for me to transform my python code to c++ is to improve
efficiency.
(need to do a series of RDKit works like embedding confromers, RMS between
confs, Shape Tanimoto distances, etc., with a lot of my own programming
logic)
In addition, profiling my python code showed the RMS (bestrms) step is the
bottleneck, is the C++ version of RMS code coming soon?

I will keep tracking the changes you make in the near future. Really
appreciate it!

Best,
Leon




On Wed, Feb 26, 2020 at 11:17 AM David Cosgrove 
wrote:

> Hi Leon,
> There is indeed such a thing.  It's not as complete as the Python one, as
> it was rather more work than I anticipated.  Also, I haven't been keeping
> the examples uptodate, especially the newer ways of iterating over atoms
> and bonds, and the CMakeLists.txt. It should give you some useful pointers,
> however. You can find it here:
> https://github.com/rdkit/rdkit/blob/master/Docs/Book/GettingStartedInC%2B%2B.md,
> which should be in $RDBASE/Docs/Book if you have cloned the repo.  The
> examples are in C++Examples in that directory also.
> I will try and find time over the next few weeks to make the examples
> current.  Also, underneath $RDBASE/Code there are lots of files called
> test*cpp which are the unit tests for the various parts, and they have
> useful stuff in them as well.
> Cheers,
> Dave
>
>
> On Wed, Feb 26, 2020 at 3:53 PM topgunhaides . 
> wrote:
>
>> Hi guys,
>>
>> I noticed that someone asked such question some years ago.
>> Since it is now 2020, do we now have anything like "Getting Started with
>> the RDKit in C++"?
>>
>> I am planning to transfer my RDKit python code to C++.
>> Can anyone give me some resources? I found some, but just in case that I
>> missed important ones. Any suggestions are very welcome. Thanks!
>>
>> Best,
>> Leon
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit in C++

2020-02-26 Thread David Cosgrove

Hi Leon,
There is indeed such a thing.  It's not as complete as the Python one, as
it was rather more work than I anticipated.  Also, I haven't been keeping
the examples uptodate, especially the newer ways of iterating over atoms
and bonds, and the CMakeLists.txt. It should give you some useful pointers,
however. You can find it here:
https://github.com/rdkit/rdkit/blob/master/Docs/Book/GettingStartedInC%2B%2B.md,
which should be in $RDBASE/Docs/Book if you have cloned the repo.  The
examples are in C++Examples in that directory also.
I will try and find time over the next few weeks to make the examples
current.  Also, underneath $RDBASE/Code there are lots of files called
test*cpp which are the unit tests for the various parts, and they have
useful stuff in them as well.
Cheers,
Dave

On Wed, Feb 26, 2020 at 3:53 PM topgunhaides .  wrote:

> Hi guys,
>
> I noticed that someone asked such question some years ago.
> Since it is now 2020, do we now have anything like "Getting Started with
> the RDKit in C++"?
>
> I am planning to transfer my RDKit python code to C++.
> Can anyone give me some resources? I found some, but just in case that I
> missed important ones. Any suggestions are very welcome. Thanks!
>
> Best,
> Leon
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit in C++

2020-02-26 Thread Paolo Tosco


Hi Leon,

there is nice document produced by David Cosgrove and Greg Landrum:

https://github.com/rdkit/rdkit/blob/master/Docs/Book/GettingStartedInC%2B%2B.md

RDKit C++ unit tests, RDKit C++ API documentations and headers are also 
very helpful.


Cheers,
p.

On 26/02/2020 15:51, topgunhaides . wrote:

Hi guys,

I noticed that someone asked such question some years ago.
Since it is now 2020, do we now have anything like "Getting Started 
with the RDKit in C++"?


I am planning to transfer my RDKit python code to C++.
Can anyone give me some resources? I found some, but just in case 
that I missed important ones. Any suggestions are very welcome. Thanks!


Best,
Leon




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] RDKit in C++

2020-02-26 Thread topgunhaides .

Hi guys,

I noticed that someone asked such question some years ago.
Since it is now 2020, do we now have anything like "Getting Started with
the RDKit in C++"?

I am planning to transfer my RDKit python code to C++.
Can anyone give me some resources? I found some, but just in case that I
missed important ones. Any suggestions are very welcome. Thanks!

Best,
Leon
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-26 Thread Tim Dudgeon

Well, as I mentioned previously the big difference is because from
Python you are iterating through the molecules, calculating the
fingerprints and then doing a comparison on the fingerprints. Whereas in
the PostgreSQL cartridge the fingerprints are already generated and
indexed so the search is mostly about querying the index which will be
very fast.

If you are repeatedly running queries against the same set of molecules
then the cartridge will e the way to go. Doing ti procedurally from
Python only really makes sense if you have a relatively small dataset
and/or if the molecules you are searching are different every time.

In principle you should be able to cache the fingerprints in Python to
avoid needing to recalculate them, but effectively you're implementing
logic that is already present in the cartridge, and will be much more
effective.

Tim

On 26/02/2020 08:46, Deepti Gupta wrote:

Hi Tim,

Thank you!

I'll be more detailed in my post, sorry about that. As this was a PoC,
I had a spark cluster with 2 worker nodes with 4 vCPUs with disk size
500GB and memory 15GB on Google Cloud. I timed the response against 2
million data points consisting of Chembl id, Smile structures.

Substructure search - 2 mins
Similarity search - 43 mins

PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500
GB and 15GB memory. The value of shared_buffers = 2048MB was edited
in the postgresql.conf file.

Substructure search - within 5 secs
Similarity search - within 3 secs

I tried to store the converted molecules and fingerprints in a file to
get better performance while trying the pyspark program but was not
able to do so.

Regards,
DA

On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon
wrote:

I think you need to explain what benchmarks you are running and what
is really meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL
what size server, what settings esp. the shared_buffers setting).

A very obvious critique of what you reported is that what you describe
as "running in Python" includes generating the fingerprints for each
molecule on the fly, whereas for "the cartridge" these are already
calculated, so will obviously be much faster (as the fingerprint
generation dominates the compute).

Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a
PoC where I've to compare RDKit in Python and RDKit on PostgreSQL.
I've installed both and am trying some hands-on exercises to
understand the differences. What I've understood that the structure
searches are slower in Python (Spark Cluster) than in PostgreSQL
database. Please correct me if I'm wrong as I'm a newbie in this and
maybe talking silly.

The similarity search using the below functions (example) -
Python methods -

fps =
FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure,
sanitize=False))

similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing
is very quick (in seconds) on PostgreSQL

Database functions -

select count(*) from (select
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
as similarity from fingerprints join mols using (modality_id)) as fps
where similarity between 0.45 and 0.50;

Does this conclude that for production workloads one must always use
database cartridge only? Like RDKit, BINGO, etc.?

Regards,
DA

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Source of the solubility data?

2020-02-26 Thread Greg Landrum

That's a great idea. A pull request with that change would be very welcome.
:-)

On Wed, Feb 26, 2020 at 3:19 AM Gao Zhenting  wrote:

> Hi Greg,
>
> Thanks for the details.
> Would you like to add this note to the GitHub(
> https://github.com/rdkit/rdkit/tree/master/Docs/Book/data)? Other
> visitors will get the message then.
>
> Best regards
> Zhenting
>
> Greg Landrum  于2020年2月26日周三 下午2:48写道：
>
>> Hi Zhenting,
>>
>> That's the Huuskonen dataset. The reference is here:
>> https://pubs.acs.org/doi/10.1021/ci9901338
>> The origins of the SDF itself are unfortunately lost in antiquity. I
>> originally got them here:
>> http://cheminformatics.org/datasets/huuskonen/index.html
>> but cheminformatics.org no longer exists. archive.org isn't working at
>> the moment, but when it's back someone could check there to try and figure
>> out who curated the SDF
>>
>> -greg
>>
>>
>> On Mon, Feb 24, 2020 at 11:53 PM Gao Zhenting 
>> wrote:
>>
>>> Hi Greg,
>>>
>>> I am trying to reproduce some machine learning scripts using
>>>
>>> https://github.com/rdkit/rdkit/blob/master/Docs/Book/data/solubility.test.sdf
>>>
>>>
>>> https://github.com/rdkit/rdkit/blob/master/Docs/Book/data/solubility.train.sdf
>>>
>>>
>>> What is the source of these data? How are they organized? Any citation?
>>>
>>> Best regards
>>> Zhenting
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-26 Thread Deepti Gupta via Rdkit-discuss

 Hi Tim,
Thank you!
I'll be more detailed in my post, sorry about that. As this was a PoC, I had a 
spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 
15GB on Google Cloud. I timed the response against 2 million data points 
consisting of Chembl id, Smile structures. 
Substructure search - 2 minsSimilarity search - 43 mins
PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 
15GB memory. The value of shared_buffers = 2048MB  was edited in the  
postgresql.conf file.
Substructure search - within 5 secsSimilarity search - within 3 secs
I tried to store the converted molecules and fingerprints in a file to get 
better performance while trying the pyspark program but was not able to do so.
Regards,DA
On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon 
 wrote:  
 
  
I think you need to explain what benchmarks you are running and what is really 
meant by "faster".
 And what hardware (for Spark how many nodes, how big; for PostgreSQL what size 
server, what settings esp. the shared_buffers setting).
 
 
A very obvious critique of what you reported is that what you describe as 
"running in Python" includes generating the fingerprints for each molecule on 
the fly, whereas for "the cartridge" these are already calculated, so will 
obviously be much faster (as the fingerprint generation dominates the compute).
 
Tim
 
 On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
  
 
 Hi Gurus, 
  I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly. 
  The similarity search using the below functions (example) - Python methods - 
fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False)) similarity = DataStructs.TanimotoSimilarity(fps1,fps2)  
  takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL  Database functions - 
select count(*) from 
(selectmodality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;  
  Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.? 
  Regards, DA  
  
  ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
  ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Source of the solubility data?

2020-02-26 Thread Gao Zhenting

Hi Greg,

Thanks for the details.
Would you like to add this note to the GitHub(
https://github.com/rdkit/rdkit/tree/master/Docs/Book/data)? Other visitors
will get the message then.

Best regards
Zhenting

Greg Landrum  于2020年2月26日周三 下午2:48写道：

> Hi Zhenting,
>
> That's the Huuskonen dataset. The reference is here:
> https://pubs.acs.org/doi/10.1021/ci9901338
> The origins of the SDF itself are unfortunately lost in antiquity. I
> originally got them here:
> http://cheminformatics.org/datasets/huuskonen/index.html
> but cheminformatics.org no longer exists. archive.org isn't working at
> the moment, but when it's back someone could check there to try and figure
> out who curated the SDF
>
> -greg
>
>
> On Mon, Feb 24, 2020 at 11:53 PM Gao Zhenting 
> wrote:
>
>> Hi Greg,
>>
>> I am trying to reproduce some machine learning scripts using
>>
>> https://github.com/rdkit/rdkit/blob/master/Docs/Book/data/solubility.test.sdf
>>
>>
>> https://github.com/rdkit/rdkit/blob/master/Docs/Book/data/solubility.train.sdf
>>
>>
>> What is the source of these data? How are they organized? Any citation?
>>
>> Best regards
>> Zhenting
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit in C++

Re: [Rdkit-discuss] RDKit in C++

Re: [Rdkit-discuss] RDKit in C++

[Rdkit-discuss] RDKit in C++

Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

Re: [Rdkit-discuss] Source of the solubility data?

Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

Re: [Rdkit-discuss] Source of the solubility data?

8 matches

Site Navigation

Mail list logo

Footer information