Wow, thank you all.  All three work. To output correctly I needed to
add:

output.write("\r\n")

This is really a great help!!

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge.  Could you recommend some resources for this kind of text
manipulation?  Also, I conceptually get it, but would you mind walking
me through

> for tok in tokens:
>         if NR_RE.match(tok) and len(chem) >= 4:
>             chem[2:-1] = [' '.join(chem[2:-1])]
>             yield chem
>             chem = []
>         chem.append(tok)

and

> for key, group in groupby(instream, unicode.isspace):
>         if not key:
>             yield "".join(group)


Thanks again,
Patrick



On Oct 15, 2:16 pm, Peter Otten <[EMAIL PROTECTED]> wrote:
> patrick.waldo wrote:
> > my sample input file looks like this( not organized,as you see it):
> > 200-720-7        69-93-2
> > kyselina mocová      C5H4N4O3
>
> > 200-001-8       50-00-0
> > formaldehyd      CH2O
>
> > 200-002-3
> > 50-01-1
> > guanidĂ­nium-chlorid      CH5N3.ClH
>
> Assuming that the records are always separated by blank lines and only the
> third field in a record may contain spaces the following might work:
>
> import codecs
> from itertools import groupby
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
>
> def fields(s):
>     parts = s.split()
>     return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]
>
> def records(instream):
>     for key, group in groupby(instream, unicode.isspace):
>         if not key:
>             yield "".join(group)
>
> if __name__ == "__main__":
>     outstream = codecs.open(path2, 'w', 'utf8')
>     for record in records(codecs.open(path, "r", "utf8")):
>         outstream.write("|".join(fields(record)))
>         outstream.write("\n")
>
> Peter


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to