Re: [R] Scanning grep through huge files

Duncan Murdoch Tue, 03 Nov 2009 06:52:50 -0800

On 11/3/2009 9:29 AM, Johannes Graumann wrote:

Hi,
I'm dealing which huge files I would like to index. On a linux system "grep-buo <PATTERN> <FILENAME>" hands me the byte offsets for "PATTERN" veryquickly and I am looking to emulate that speed and ease with native R tools- for portability and elegance. "gregexpr" should be able to do that but Ifail to combine it with "scan" or an equivalent to parse the whole filewithout having to read it all into memory.

I think you are going to have to write this yourself. R doesn't havevery many stream oriented functions: almost everything is aimed athaving the whole thing in memory.

You will also have trouble with the byte offsets. The semantics of the-u option to grep are quite strange (at least according to the man pageon Cygwin).

What I'd do given your problem is use readLines to read the file, thenpost-process the result of gregexpr to give line and byte offset pairsfor each match; those are more useful in R than the rather bizarre "byteoffsets" that grep -buo will give. But for a huge file you'll probablyhave to do this in blocks, as the whole file may be too big.


Duncan Murdoch

I'd be grateful for any hints on how to do this without a "pipe("grep -buo<PATTERN> <FILENAME>")".


Thanks, Joh

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Scanning grep through huge files

Reply via email to