Here are my thoughts on this:
The RDKit is usually strict while parsing molecules from SDF, SMILES, or
other formats. This is done for one simple reason: it tends to be
difficult/impossible to recover from syntax errors in input in a way that
doesn't result in a significant chance of producing a result that is
different from what the original writer intended. In this case, as Andrew
pointed out elsewhere on the thread, if Paolo's suggested patch is applied,
the molecule will be loaded with the TESTFIELD property present, but
different from what it was in the input. Since people ignore warning
messages (again quoting Andrew) this difference is not going to be noticed
most of the time.

There are exceptions to this: the RDKit ignores the limit on line length
while reading SDFs: there's no chance of confusion here, so I believe it's
safe to do so.

I'm planning on accepting Paolo's patch, but after it has been modified to
only accept the extra blank lines if the SDMolSupplier is not in strict
mode. This will allow these files to be parsed if the client/user indicates
that they are willing to take the risk of incorrect data.

I still need to put some thought into patching the SDWriter so that it can
recognize things like consecutive line endings in property values. The big
question is what it should do when it encounters such a case. Is that an
error? Should it just write the output up to the blank line?

-greg


On Wed, Apr 29, 2015 at 10:47 AM, Tuomo Kalliokoski <tkall...@live.com>
wrote:

> Hello all,
>
> I have got a bunch of SDF-files with molecules and some long descriptions
> in SDF-tags on them that include stuff like "->" inside.
> These files have been produced by ChemAxon's software and are handled fine
> by their software.
> Such files can be written out also from RDKit 2014_09_02, but they fail
> when you try to read them in.
>
> Here is an example code:
>
> 1. Generate t.sdf in Python:
>
>   from rdkit import Chem
>   mol = Chem.MolFromSmiles("CC")
>   mol.SetProp("TESTFIELD","This should not work -> Let's see\n\nI guess
> this is not visible\n")
>   mol.SetProp("TESTFIELD2","Beep")
>   mol2 = Chem.MolFromSmiles("CCC")
>   mol2.SetProp("TESTFIELD","Added another molecule -> Here the same
> thing\n\nI guess this is not visible\n")
>   mol2.SetProp("TESTFIELD2","Beep")
>   w = Chem.SDWriter("t.sdf")
>   w.write(mol)
>   w.write(mol2)
>   w.close()
>
> 2. Trying to read the file in Python fails:
>
>    from rdkit import Chem
>    s = Chem.SDMolSupplier("t.sdf")
>    for mol in s:
>       print mol.GetProp("TESTFIELD")
>       // The TESTFIELD text is cropped and TESTFIELD2 is skipped completely
>       // so the line below will fail:
>       // print mol.GetProp("TESTFIELD2")
>
> [10:29:43] ERROR: Problems encountered parsing data fields
> [10:29:43] ERROR: moving to the begining of the next molecule
>
> I guess in this case I will do some pre-processing for the files before
> reading them in SDMolSupplier, but I just wanted to point out this special
> case. Apologies if this was old news, but at least I was unable to find it
> after quick look.
>
> Best regards,
> Tuomo
>
>
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to