Ed,  As always, there are no 'one size fits all' solutions, it all depends
on what you need to do. I was processing tens of millions of screening
compounds into a database and used a desalter/desolvator written using the
RDkit C++ API. That was quite quick enough for my needs - I never tried it
with Python so I don't know what sort of performance difference there would
be.

Your pyrrolidinium tosylate example is a good one and again 'it all
depends'. In my case it would be removed completely and I'd be happy with
that. For other purposes it may not be so obvious - maybe you'd want to
keep both fragments in some cases...

Chris

On 29 June 2018 at 12:09, Ed Griffen <ed.grif...@medchemica.com> wrote:

> Chris, Absolutely agree with your points - processing the molecules into
> RDkit is much more robust, but it depends though on how many you’ve got to
> process.  If you’re doing millions to billions, then the overhead can
> become a problem and doing it in two steps (lexical then graph) can be the
> pragmatic solution.
>
> Desalting by removing the smallest fragment performs as expected -
> pyrollidinium tosylate - which part of the salt do you want to discard? If
> you don’t know it’s hard to create a heuristic.
>
> C1CC[NH2+]C1.Cc1ccc(cc1)S([O-])(=O)=O
>
> Ed
>
>
> Dr Ed Griffen,
> Technical Director,
> mobile +44 7762 121593
> office +44 1625 238843
> ed.grif...@medchemica.com
> www.medchemica.com
> skype: ed.griffen
> Twitter: @MedChemica
> Medchemica Ltd is a company registered in England and Wales with company
> number 8162245.
>
> Confidentiality Notice: This message is private and may contain
> confidential, proprietary and legally privileged information. If you have
> received this message in error, please notify us and remove it from your
> system and note that you must not copy, distribute or take any action in
> reliance on it. Any unauthorised use or disclosure of the contents of
> this message is not permitted and may be unlawful.
> Disclaimer: Email messages may be subject to delays, interception,
> non-delivery and unauthorised alterations. Therefore, information expressed
> in this message is not given or endorsed by MedChemica Limited unless
> otherwise notified by an authorised representative independent of this
> message. No contractual relationship is created by this message by any
> person unless specifically indicated by agreement in writing other than
> email.
> Monitoring: MedChemica Limited retains and monitors all email traffic data
> and content for the purposes of the prevention and detection of
> crime, ensuring the security of our computer systems and checking
> compliance with our policies.
>
> On 29 Jun 2018, at 11:59, Chris Earnshaw <cgearns...@gmail.com> wrote:
>
> I'd say that using RDkit to calculate the numbers of heavy atoms is
> significantly more robust than a purely lexical approach - and it's easy to
> implement.
>
> It's also dangerous to just discard the smallest fragment. Years ago I
> worked on a project where the active molecule had only 11 heavy atoms and
> the counterion (dicyclohexylamine) had 13 - so relying on atom counts is a
> way to sometimes throw the baby out with the bath water. It's much safer
> (but also a lot more work) to build a desalter/desolvater that explicitly
> removes just the fragments you really want to remove.
>
> Best regards,
> Chris
>
> On 29 June 2018 at 09:56, Ed Griffen <ed.grif...@medchemica.com> wrote:
>
>> Using the string length to find the number of atoms in a molecule is OK -
>> but you need to take account of the additional characters in SMILES that
>> are not just atoms, for example:
>>
>> two letter elements - like silicon, chlorine etc
>> brackets , ring closures, charges, explicit hydrogens
>>
>> It’s simple to do:
>>
>> Here’s a worked example:
>>
>> >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C'
>> >>> print(len(SMILES))
>> 34
>> >>> heavies = [char for char in SMILES if char not in
>> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''']
>> >>> print(len(heavies))
>> 13
>>
>> obviously you do this after splitting on the .
>>
>> Best regards,
>>
>> Ed
>>
>> Dr Ed Griffen,
>> Technical Director,
>> mobile +44 7762 121593
>> office +44 1625 238843
>> ed.grif...@medchemica.com
>> www.medchemica.com
>> skype: ed.griffen
>> Twitter: @MedChemica
>> Medchemica Ltd is a company registered in England and Wales with company
>> number 8162245.
>>
>> Confidentiality Notice: This message is private and may contain
>> confidential, proprietary and legally privileged information. If you have
>> received this message in error, please notify us and remove it from your
>> system and note that you must not copy, distribute or take any action in
>> reliance on it. Any unauthorised use or disclosure of the contents of
>> this message is not permitted and may be unlawful.
>> Disclaimer: Email messages may be subject to delays, interception,
>> non-delivery and unauthorised alterations. Therefore, information expressed
>> in this message is not given or endorsed by MedChemica Limited unless
>> otherwise notified by an authorised representative independent of this
>> message. No contractual relationship is created by this message by any
>> person unless specifically indicated by agreement in writing other than
>> email.
>> Monitoring: MedChemica Limited retains and monitors all email traffic
>> data and content for the purposes of the prevention and detection of
>> crime, ensuring the security of our computer systems and checking
>> compliance with our policies.
>>
>> On 29 Jun 2018, at 06:37, Alfredo Quevedo <maquevedo....@gmail.com>
>> wrote:
>>
>> thank you Hideyoshi for your feedback.
>> regards
>> Alfredo
>>
>> Enviado desde BlueMail <http://www.bluemail.me/r?b=13187>
>> En 28 de junio de 2018, en 21:43, "藤秀義" <hideyoshif...@gmail.com>
>> escribió:
>>>
>>> Dear Alfredo,
>>>
>>> Although not strictly based on the number of atoms, but on the length of
>>> SMILES string, the simplest way is using Python built-in functions as
>>> follows:
>>>
>>> smiles = 'CCC.CC'
>>> fragment = max(smiles.split('.'), key=len)
>>> print (fragment)
>>>
>>> Best regards,
>>>
>>> Hideyoshi
>>>
>>>
>>> thank you Paolo for this help, I will study the code and try it,
>>>>
>>>> best regards
>>>>
>>>> Alfredo
>>>>
>>>> Enviado desde BlueMail <http://www.bluemail.me/r?b=13187>
>>>>
>>>> En 28 de junio de 2018, en 17:08, Paolo Tosco <
>>>> paolo.tosco.m...@gmail.com> escribió:
>>>>
>>>> Dear Alfredo,
>>>>
>>>> if you wish to keep only the largest disconnected fragment you may try
>>>> the following:
>>>>
>>>> mols = list(rdmolops.GetMolFrags(mol, asMols = True))
>>>> if (mols):
>>>>      mols.sort(reverse = True, key = lambda m: m.GetNumAtoms())
>>>>      mol = mols[0]
>>>>
>>>> Hope that helps, cheers
>>>> p.
>>>>
>>>> On 06/28/18 19:38, Alfredo Quevedo wrote:
>>>>
>>>>  Good afternoon,
>>>>
>>>>  I would like to filter out small fragments from a list of molecules
>>>>  using the below strategy:
>>>>
>>>>  from rdkit import Chem
>>>>  from rdkit.Chem import AllChem
>>>>  from rdkit.Chem import SaltRemover fragment
>>>>
>>>>  remover=SaltRemover.SaltRemover()
>>>>  mol=Chem.MolFromSmiles('CCC.CC')
>>>>  res=remover.StripMol(mol)
>>>>  print(res.GetNumAtoms())
>>>>
>>>>
>>>>  I am getting 5 atoms as output, so the ´CC´ is not being stripped (the
>>>>  script workd ok for salts). Is there any way of filtering non salts
>>>>  small fragments?
>>>>
>>>>  thank you very much in advance,
>>>>
>>>>  regards,
>>>>
>>>>  Alfredo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>>
>>>>
>>>>  Check out the vibrant tech community on one of the world's most
>>>>  engaging tech sites, Slashdot.org <http://slashdot.org/>! 
>>>> http://sdm.link/slashdot
>>>>
>>>> ------------------------------
>>>>
>>>>
>>>>  Rdkit-discuss mailing list
>>>>  Rdkit-discuss@lists.sourceforge.net
>>>>  https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org <http://slashdot.org/>!
>> http://sdm.link/slashdot_______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org <http://slashdot.org>!
>> http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to