On Apr 14, 2022, at 12:57, Ivan Tubert-Brohman <ivan.tubert-broh...@schrodinger.com> wrote: > How about splitting the file on lines consisting of "$$$$", and then parsing > each record? If the parsing fails, you can write out the bad record for > future inspection. (This addresses the basic use case, but not the "even > better" one.)
Yes, if you know your data is "clean", then you can do that. I wrote an essay at http://www.dalkescientific.com/writings/diary/archive/2020/09/18/handling_the_sdf_record_delimiter.html about some of the ways that can cause problems. They do occur in real-world data sets. And they do cause problems in some processing pipelines. Public data sets like PubChem, ChEMBL, etc. don't have these problems. They are mostly in in-house data sets. Though it's not common to have a problem. > def read_record(fh): > lines = [] > for line in fh: > lines.append(line) > if line.rstrip() == '$$$$': > return ''.join(lines) See also https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html . The reasons I think there should be a low-level library for this sort of work are: 1) the edge cases are tricky to handle, 2) the simple readers like this are slow 3) I believe good error reporting needs things like the starting line number and/or starting byte position for the record. Implementing that is a bit tricky (and boring), and tracking that information in a compiled extension has a much lower overhead than doing it in Python. Cheers, Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss