"This is a regular expression problem, rather than a Python problem."
Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported? "Can the lexing be done on a line-by-line basis?" For my use case, it unfortunately can't. On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <jfine2...@gmail.com> wrote: > Hi Ram > > You wrote: > > > I'd like to use the re module to parse a long text file, 1GB in size. I > > wish that the re module could parse a stream, so I wouldn't have to load > > the whole thing into memory. I'd like to iterate over matches from the > > stream without keeping the old matches and input in RAM. > > This is a regular expression problem, rather than a Python problem. A > search for > regular expression large file > brings up some URLs that might help you, starting with > > https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow > > This might also be helpful > https://svn.boost.org/trac10/ticket/11776 > > What will work for your problem depends on the nature of the problem > you have. The simplest thing that might work is to iterate of the file > line-by-line, and use a regular expression to extract matches from > each line. > > In other words, something like (not tested) > > def helper(lines): > for line in lines: > yield from re.finditer(pattern, line) > > lines = open('my-big-file.txt') > for match in helper(lines): > # Do your stuff here > > Parsing is not the same as lexing, see > https://en.wikipedia.org/wiki/Lexical_analysis > > I suggest you use regular expressions ONLY for the lexing phase. If > you'd like further help, perhaps first ask yourself this. Can the > lexing be done on a line-by-line basis? And if not, why not? > > If line-by-line not possible, then you'll have to modify the helper. > At the end of each line, they'll be a residue / remainder, which > you'll have to bring into the next line. In other words, the helper > will have to record (and change) the state that exists at the end of > each line. A bit like the 'carry' that is used when doing long > addition. > > I hope this helps. > > -- > Jonathan > >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/