On Apr 14, 2022, at 12:57, Ivan Tubert-Brohman 
<ivan.tubert-broh...@schrodinger.com> wrote:
> How about splitting the file on lines consisting of "$$$$", and then parsing 
> each record? If the parsing fails, you can write out the bad record for 
> future inspection. (This addresses the basic use case, but not the "even 
> better" one.)

Yes, if you know your data is "clean", then you can do that.

I wrote an essay at
  
http://www.dalkescientific.com/writings/diary/archive/2020/09/18/handling_the_sdf_record_delimiter.html
about some of the ways that can cause problems.

They do occur in real-world data sets. And they do cause problems in some 
processing pipelines.

Public data sets like PubChem, ChEMBL, etc. don't have these problems. They are 
mostly in in-house data sets. Though it's not common to have a problem.

> def read_record(fh):
>     lines = []
>     for line in fh:
>         lines.append(line)
>         if line.rstrip() == '$$$$':
>             return ''.join(lines)

See also 
https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html
 .

The reasons I think there should be a low-level library for this sort of work 
are:

1) the edge cases are tricky to handle,

2) the simple readers like this are slow

3) I believe good error reporting needs things like the starting line number 
and/or starting byte position for the record. Implementing that is a bit tricky 
(and boring), and tracking that information in a compiled extension has a much 
lower overhead than doing it in Python.


Cheers,


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to