On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <[EMAIL PROTECTED]> wrote:
>Is there any way to get regexes to work on non-string/unicode objects. I would >like to split large files by regex and it seems relatively hard to do so >without >having the whole file in memory. Even with buffers it seems hard to get >regexes >to indicate that they failed because of buffer termination and getting a >partial >match to be resumable seems out of the question. > >What interface does re actually need for its src objects? ISTM splitting is a special situation where you can easily chunk through a file and split as you go, since if splitting the current chunk succeeds, you can be sure that all but the tail piece is valid[1]. So you can make an iterator that yields all but the last and then sets the buffer to last+newchunk and goes on until there are no more chunks, and the tail part will be a valid split piece. E.g., (not tested beyond what you see ;-) >>> def frxsplit(path, rxo, chunksize=8192): ... buffer = '' ... for chunk in iter((lambda f=open(path): f.read(chunksize)),''): ... buffer += chunk ... pieces = rxo.split(buffer) ... for piece in pieces[:-1]: yield piece ... buffer = pieces[-1] ... yield buffer ... >>> import re >>> rxo = re.compile('XXXXX') The test file: >>> print '----\n%s----'%open('tsplit.txt').read() ---- This is going to be split on five X's like XXXXX but we will use a buffer of XXXXX length 2 to force buffer appending. We'll try a splitter at the end: XXXXX ---- >>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece) ... "This is going to be split on five X's\nlike " ' but we will use a buffer of\n' " length 2 to force buffer appending.\nWe'll try a splitter at the end: " '\n' >>> rxo = re.compile('(XXXXX)') >>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece) ... "This is going to be split on five X's\nlike " 'XXXXX' ' but we will use a buffer of\n' 'XXXXX' " length 2 to force buffer appending.\nWe'll try a splitter at the end: " 'XXXXX' '\n' [1] In some cases of regexes with lookahead context, you might have to check that the last piece not only exists but exceeds max lookahead length, in case there is a <withlookahead>|<plain> kind of thing in the regex where <lookahead> would have succeeded with another chunk appended to buffer, but <plain> did the split. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list