Guido van Rossum wrote: > It should probably be opened in binary mode. Binary files do have a > .readline() method (returning a bytes object), and bytes objects have > a .startswith() method. The tell positions computed this way are even > compatible with those used by the text file. So you could do it this > way: > > - open binary stream > - compute TOC by reading through it using .readline() and .tell() > - rewind (don't close)
Because closing is inefficient, or because it breaks the algorithm? > - wrap the binary stream in a text stream "wrap" how? The ultimate destiny of the text is twofold: 1) To be stored as some kind of LOB in a database, and 2) Therefrom to be reconstituted and parsed into email.Message objects. Is the wrapping a one-off operation or a software layer? Sorry, being a bit dense here, I know. regards Steve > - use that for the rest of the code > > --Guido > > On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden <st...@holdenweb.com> wrote: >> A.M. Kuchling wrote: >>> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote: >>>> I will leave the profiler output to speak for itself, since I can find >>>> nothing much to say about it except that there's a hell of a lot of >>>> decoding going on inside mailbox.iterkeys(). >>> The problem is actually in _generate_toc(), which is reading through >>> the entire file to figure out where all the 'From' lines that start >>> messages are located. TextIOWrapper()'s tell() method seems to be >>> very slow, so one help is to only call tell() when necessary; patch: >>> >>> -> svn diff Lib/ >>> Index: Lib/mailbox.py >>> =================================================================== >>> --- Lib/mailbox.py (revision 82346) >>> +++ Lib/mailbox.py (working copy) >>> @@ -775,13 +775,14 @@ >>> starts, stops = [], [] >>> self._file.seek(0) >>> while True: >>> - line_pos = self._file.tell() >>> line = self._file.readline() >>> if line.startswith('From '): >>> + line_pos = self._file.tell() >>> if len(stops) < len(starts): >>> stops.append(line_pos - len(os.linesep)) >>> starts.append(line_pos) >>> elif not line: >>> + line_pos = self._file.tell() >>> stops.append(line_pos) >>> break >>> self._toc = dict(enumerate(zip(starts, stops))) >>> >>> But should mailboxes really be opened in a UTF-8 encoding, or should >>> they be treated as 7-bit text? I'll have to think about this. >> Neither! You can't open them as 7-bit text, because real-world email >> does contain bytes whose ordinal value exceeds 127. You can't open them >> using a text encoding because theoretically there might be ASCII headers >> that indicate that parts of the content are in specific character sets >> or encodings. >> >> If only we had a data structure that easily allowed us to manipulate >> 8-bit characters ... >> >> regards >> Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 See Python Video! http://python.mirocommunity.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/ "All I want for my birthday is another birthday" - Ian Dury, 1942-2000 _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com