It should probably be opened in binary mode. Binary files do have a .readline() method (returning a bytes object), and bytes objects have a .startswith() method. The tell positions computed this way are even compatible with those used by the text file. So you could do it this way:
- open binary stream - compute TOC by reading through it using .readline() and .tell() - rewind (don't close) - wrap the binary stream in a text stream - use that for the rest of the code --Guido On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden <st...@holdenweb.com> wrote: > A.M. Kuchling wrote: >> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote: >>> I will leave the profiler output to speak for itself, since I can find >>> nothing much to say about it except that there's a hell of a lot of >>> decoding going on inside mailbox.iterkeys(). >> >> The problem is actually in _generate_toc(), which is reading through >> the entire file to figure out where all the 'From' lines that start >> messages are located. TextIOWrapper()'s tell() method seems to be >> very slow, so one help is to only call tell() when necessary; patch: >> >> -> svn diff Lib/ >> Index: Lib/mailbox.py >> =================================================================== >> --- Lib/mailbox.py (revision 82346) >> +++ Lib/mailbox.py (working copy) >> @@ -775,13 +775,14 @@ >> starts, stops = [], [] >> self._file.seek(0) >> while True: >> - line_pos = self._file.tell() >> line = self._file.readline() >> if line.startswith('From '): >> + line_pos = self._file.tell() >> if len(stops) < len(starts): >> stops.append(line_pos - len(os.linesep)) >> starts.append(line_pos) >> elif not line: >> + line_pos = self._file.tell() >> stops.append(line_pos) >> break >> self._toc = dict(enumerate(zip(starts, stops))) >> >> But should mailboxes really be opened in a UTF-8 encoding, or should >> they be treated as 7-bit text? I'll have to think about this. > > Neither! You can't open them as 7-bit text, because real-world email > does contain bytes whose ordinal value exceeds 127. You can't open them > using a text encoding because theoretically there might be ASCII headers > that indicate that parts of the content are in specific character sets > or encodings. > > If only we had a data structure that easily allowed us to manipulate > 8-bit characters ... > > regards > Steve > -- > Steve Holden +1 571 484 6266 +1 800 494 3119 > See Python Video! http://python.mirocommunity.org/ > Holden Web LLC http://www.holdenweb.com/ > UPCOMING EVENTS: http://holdenweb.eventbrite.com/ > "All I want for my birthday is another birthday" - > Ian Dury, 1942-2000 > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com