So you raise a couple of questions: 1) How about handling runes? The nice thing about utf8 is you don't have to care. If you are searching for the word ascii byte 'test', you can simply compare byte by byte - the letter t is represented by 0x74, and this byte in the search buffer can only represent t - it can never be part of a multi-byte-rune - not in utf8.
2) Can we handle lines of hundreds of MB? Not really. The default will limit MaxTokenSize to 64K. If you set your own buffer (recommended) then the size of this buffer will be your maximum line length. Unfortunately you can't really solve this problem without slurping the entire file into memory. Because the whole file could consist of a single line, with the word "test" at the very end - and at this point, you will still need access to the beginning of the file in order to output it. 3) The Mike Haertel message (from 2010) is interesting, as is the reference to the paper "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 1991 issue of Software Practice & Experience All this is a quite old. Computer architecture has changed significantly since then, CPUs have got a lot faster and added vector instructions which speed up dumb algorithms a lot. The relative cost of accessing memory has massively increased, as memory speeds have not increased at anything like the rate of CPU speeds (though the Apple M1 system on chips family has changed this) so memory -> CPU is often the bottleneck, and optimising CPU performance through clever algorithms often merely results in the CPU being stuck in stall cycles waiting for cache misses to be serviced. That said, some of the points Mike makes are still valid. One is the massive cost of breaking up lines - which slows down performance by an order of magnitude, by forcing you to examine every byte. So searching for the search-text, and then when you find it, searching backwards from that point to find the enclosing line, is a big win. The other point he makes which is still valid is to consider the cost of copying data. This will probably be the dominant cost. Running `cat > /dev/null` will tell you how long it takes to merely read the data, and this will give you a baseline figure. A reasonably carefully written grep clone should achieve roughly the same runtime. As you specified the problem as reading from stdin, using memory mapped files, using faster filesystems and fast SSDs will not help you. And, as always in Go, allocations trump almost everything else. If you end up converting each line into a string, forcing allocations and GC work, that will knock orders of magnitude off your performance. Disclaimer: I have not actually written any code or run any benchmarks here. But there are some interesting questions which could make an interesting paper. To what extent has Hume and Sunday's 1991 paper been invalidated by hardware changes? Are smart algorithm's which avoid looking at every byte still a win, or do they actually slow down performance by adding branch mispredictions, and preventing vectorisation optimisations? To what extent does the Apple/Arm system on chip family, with their much lower memory latency, change these considerations? How much slower will a simple Go clone of grep be than the C version which has had decades of optimisation? And how much effort would it take to optimise the Go clone to reach the performance of the C version? If anyone wants to write the code, do the benchmarks, and publish the conclusions, then send me a link. It will be an interesting read. On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote: > I cannot use fgrep in this environment. > > Looks like Scan buffer is resizable. > > https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190 > // Is the buffer full? If so, resize. > if s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .end > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36> > > == len(s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .buf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) > > { > // Guarantee no overflow in the multiplication below. > const maxInt > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=maxInt&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23const%2520%255Bscan.go%2523192%255D.maxInt> > > = int > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=int&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23int%2523builtin> > (^uint > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=uint&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23uint%2523builtin>(0) > > >> 1) > if len(s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .buf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) > > >= s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .maxTokenSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32> > > || len(s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .buf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) > > > maxInt > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=192>/2 > > { > s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .setErr > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=252> > (ErrTooLong > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=69>) > > > return false > } > newSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=198?gsn=newSize&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523197%255D.newSize> > > := len(s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .buf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) > > * 2 > if newSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> > > == 0 { > newSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> > > = startBufSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=82> > > } > if newSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> > > > s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .maxTokenSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32> > > { > newSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> > > = s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .maxTokenSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32> > > } > newBuf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=newBuf&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523204%255D.newBuf> > > := make([]byte > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=byte&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23byte%2523builtin>, > > newSize > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>) > > > copy(newBuf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>, > > s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .buf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34> > [s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .start > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35> > :s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .end > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>]) > > > s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .buf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34> > > = newBuf > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204> > > s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .end > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36> > > -= s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .start > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35> > > s > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> > .start > <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35> > > = 0 > } > On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote: > >> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote: >> > The question is will scanner.Scan handle a line of 100s MB? >> >> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer). >> But that that point you want to start questioning why you're doing what >> you're doing. >> >> Your invocation of grep can be a lot simpler >> >> ``` >> func grep(match string, r io.Reader, stdout, stderr io.Writer) error { >> cmd := exec.Command("fgrep", match) >> cmd.Stdin = r >> cmd.Stdout = stdout >> cmd.Stderr = stderr >> return cmd.Run() >> } >> ``` >> >> You can then do whatever you want with the buffered output and stderr. >> You can also make this use a pipe with minor changes if you expect the >> output to be long and that stream processing of that would be sensible >> (use cmd.Start instead of Run and look into StdoutPipe). >> >> >> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/b77a013d-b513-4a7c-a5e4-91f1c0d8e6d1n%40googlegroups.com.