> lines = open('your_file.txt').readlines()[:4] > print lines > print map(len, lines)
gave me: ['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n'] [28, 32, 1, 18] I think it means that I'm still at option 3. I got the line by line part. My code is a lot cleaner now: import codecs path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to combine the files correctly file = u'|'.join(tokens) #this does put '|' in between print file + u'\n' output.write(file + u'\r\n') input.close() output.close() my sample input file looks like this( not organized,as you see it): 200-720-7 69-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... and after the program I get: 200-720-7|69-93-2| kyselina|mocová||C5H4N4O3 200-001-8|50-00-0| formaldehyd|CH2O| 200-002-3| 50-01-1| guanidínium-chlorid|CH5N3.ClH| etc... So, I am sort of back at the start again. If I add: tokens = line.strip().split() for token in tokens: print token I get all the single tokens, which I thought I could then put together, except when I did: for token in tokens: s = u'|'.join(token) print s I got ?|2|0|0|-|7|2|0|-|7, etc... How can I join these together into nice neat little lines? When I try to store the tokens in a list, the tokens double and I don't know why. I can work on getting the chemical names together after...baby steps, or maybe I am just missing something obvious. The first two numbers will always be the same three digits-three digits-one digit and then two digits-two digits-one digit. This seems to be on the only pattern. My intuition tells me that I need to add an if statement that says, if the first two numbers follow the pattern, then continue, if they don't (ie a chemical name was accidently split apart) then the third entry needs to be put together. Something like if tokens[1] and tokens[2] startswith('pattern') == true tokens[2] = join(tokens[2]:tokens[3]) token[3] = token[4] del token[4] but the code isn't right...any ideas? Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have a couple O'Reilly books, but they don't seem to have a straightforward example for this kind of text manipulation. Patrick On Oct 14, 11:17 pm, John Machin <[EMAIL PROTECTED]> wrote: > On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote: > > > > > Hi all, > > > I started Python just a little while ago and I am stuck on something > > that is really simple, but I just can't figure out. > > > Essentially I need to take a text document with some chemical > > information in Czech and organize it into another text file. The > > information is always EINECS number, CAS, chemical name, and formula > > in tables. I need to organize them into lines with | in between. So > > it goes from: > > > 200-763-1 71-73-8 > > nátrium-tiopentál C11H18N2O2S.Na to: > > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na > > > but if I have a chemical like: kyselina močová > > > I get: > > 200-720-7|69-93-2|kyselina|močová > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál > > > and then it is all off. > > > How can I get Python to realize that a chemical name may have a space > > in it? > > Your input file could be in one of THREE formats: > (1) fields are separated by TAB characters (represented in Python by > the escape sequence '\t', and equivalent to '\x09') > (2) fields are fixed width and padded with spaces > (3) fields are separated by a random number of whitespace characters > (and can contain spaces). > > What makes you sure that you have format 3? You might like to try > something like > lines = open('your_file.txt').readlines()[:4] > print lines > print map(len, lines) > This will print a *precise* representation of what is in the first > four lines, plus their lengths. Please show us the output. -- http://mail.python.org/mailman/listinfo/python-list