Wow, thank you all. All three work. To output correctly I needed to add: output.write("\r\n")
This is really a great help!! Because of my limited Python knowledge, I will need to try to figure out exactly how they work for future text manipulation and for my own knowledge. Could you recommend some resources for this kind of text manipulation? Also, I conceptually get it, but would you mind walking me through > for tok in tokens: > if NR_RE.match(tok) and len(chem) >= 4: > chem[2:-1] = [' '.join(chem[2:-1])] > yield chem > chem = [] > chem.append(tok) and > for key, group in groupby(instream, unicode.isspace): > if not key: > yield "".join(group) Thanks again, Patrick On Oct 15, 2:16 pm, Peter Otten <[EMAIL PROTECTED]> wrote: > patrick.waldo wrote: > > my sample input file looks like this( not organized,as you see it): > > 200-720-7 69-93-2 > > kyselina mocová C5H4N4O3 > > > 200-001-8 50-00-0 > > formaldehyd CH2O > > > 200-002-3 > > 50-01-1 > > guanidĂnium-chlorid CH5N3.ClH > > Assuming that the records are always separated by blank lines and only the > third field in a record may contain spaces the following might work: > > import codecs > from itertools import groupby > > path = "c:\\text_samples\\chem_1_utf8.txt" > path2 = "c:\\text_samples\\chem_2.txt" > > def fields(s): > parts = s.split() > return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1] > > def records(instream): > for key, group in groupby(instream, unicode.isspace): > if not key: > yield "".join(group) > > if __name__ == "__main__": > outstream = codecs.open(path2, 'w', 'utf8') > for record in records(codecs.open(path, "r", "utf8")): > outstream.write("|".join(fields(record))) > outstream.write("\n") > > Peter -- http://mail.python.org/mailman/listinfo/python-list