Thank you Nils. In fact I do want the sanitize + parse to happen, and I do some further checks on the molecules, too (ChEMBL pipeline etc). The issue is that whatever does not pass the initial steps just completely disappears and cannot be reported or inspected in any way.
Indeed, making a custom SDF parser would be one option, as an SDF is just text, and rigidly 'structured' by its very definition; only, I was hoping someone had already written such a parser :) For now I will just output the indices of the failed records; the user will then have to read them in another application for inspection. Thanks Giovanni -----Original Message----- From: Nils Weskamp <nils.wesk...@gmail.com> Sent: 13 April 2022 22:55 To: Giovanni Tricarico <giovanni.tricar...@glpg.com>; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None? [You don't often get email from nils.wesk...@gmail.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] Hello Giovanni, have you tried using the ForwardSDMolSupplier with sanitize = False and / or strictParsing = False ? This should at least reduce the number of cases where molecules are not accepted. You would then have to sanitize the structures yourself afterwards and handle possible errors explicitly. If that doesn't solve your problem, I would consider to write my own parser that just ignores everything looking like a CTAB. Hope this helps, Nils Am 13.04.2022 um 18:15 schrieb Giovanni Tricarico: > Hello, > > I am using rdkit to read data from SD files. > > My goal is to extract both the molecules and their associated > properties (which for our purposes are separate entities) from the SDF. > > [For 100% clarity: by 'properties' I don't mean calculated properties > or atom or bond properties, but the text properties that were saved in > the SDF with each molecule, i.e. those that you get when you do > mol.GetPropsAsDict() ]. > > After several tests I found that Chem.ForwardSDMolSupplier does what I need. > > But there is an issue. > > When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e. > when it says it is None, the SDF record is lost: > > I cannot access its Props; I cannot save the failed SDF record for > later inspection. > > [Or at least, I don't know how to do it, hence this question]. > > At most I can collect the indices of the records that fail. > > > Would anyone be able to suggest how to save to a text file (which > an SDF essentially already is) the SDF records for which > Chem.ForwardSDMolSupplier returns a None? > > > Even better, could the properties associated to the failed > molecules be read independently? In theory the properties are in a > separate part of the CTAB, so even when the atoms, bonds, etc. have a > problem, the properties might still be OK. > > (Note: PandasTools.LoadSDF gives the same issue, it does not even > store in the DataFrame the records for which the molecule is None, and > in any case it cannot be used with the kind of SDF's I am handling, as > it uses an enormous amount of memory for the molecules - hence the > decision to use Chem.ForwardSDMolSupplier and pickle the molecules as > soon as they are read). > > Thanks > > This e-mail and its attachment(s) (if any) may contain confidential > and/or proprietary information and is intended for its addressee(s) > only. Any unauthorized use of the information contained herein > (including, but not limited to, alteration, reproduction, > communication, distribution or any other form of dissemination) is strictly > prohibited. > If you are not the intended addressee, please notify the originator > promptly and delete this e-mail and its attachment(s) (if any) > subsequently. Neither Galapagos nor any of its affiliates shall be > liable for direct, special, indirect or consequential damages arising > from alteration of the contents of this message (by a third party) or > as a result of a virus being passed on. > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist > s.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=04%7C01% > 7C%7C06b17f54c04848d6a73d08da1d8fdde9%7C627f3c33bccc48bba033c0a6521f76 > 42%7C1%7C0%7C637854801140726081%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj > AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata= > %2Fems54K9HDBuBaQeb3CWlJpPov168sUuOFHAAmrbwWw%3D&reserved=0 This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss