patrick.waldo wrote: > my sample input file looks like this( not organized,as you see it): > 200-720-7 69-93-2 > kyselina mocová C5H4N4O3 > > 200-001-8 50-00-0 > formaldehyd CH2O > > 200-002-3 > 50-01-1 > guanidĂnium-chlorid CH5N3.ClH
Assuming that the records are always separated by blank lines and only the third field in a record may contain spaces the following might work: import codecs from itertools import groupby path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" def fields(s): parts = s.split() return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1] def records(instream): for key, group in groupby(instream, unicode.isspace): if not key: yield "".join(group) if __name__ == "__main__": outstream = codecs.open(path2, 'w', 'utf8') for record in records(codecs.open(path, "r", "utf8")): outstream.write("|".join(fields(record))) outstream.write("\n") Peter -- http://mail.python.org/mailman/listinfo/python-list