Using r.ReadLine() I can successfully read 100 MB line in a string, using the following conditional statement which is increasing the buffer until '\n' is encountered. for isPrefix && err == nil { line, isPrefix, err = r.ReadLine() ln = append(ln, line...) }
Now the last problem is how to search a substring in 100MB string On Saturday, May 7, 2022 at 10:25:29 PM UTC-7 Amnon wrote: > So you raise a couple of questions: > > 1) How about handling runes? > The nice thing about utf8 is you don't have to care. If you are searching > for the word ascii byte 'test', you can > simply compare byte by byte - the letter t is represented by 0x74, and > this byte in the search buffer can > only represent t - it can never be part of a multi-byte-rune - not in utf8. > > 2) Can we handle lines of hundreds of MB? > Not really. The default will limit MaxTokenSize to 64K. > If you set your own buffer (recommended) then the size of this buffer will > be your maximum line length. > Unfortunately you can't really solve this problem without slurping the > entire file into memory. Because > the whole file could consist of a single line, with the word "test" at the > very end - and at this point, you will > still need access to the beginning of the file in order to output it. > > 3) The Mike Haertel message (from 2010) is interesting, as is the > reference to the paper > "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November > 1991 issue of Software Practice & Experience > All this is a quite old. Computer architecture has changed significantly > since then, CPUs have got a lot faster and added vector > instructions which speed up dumb algorithms a lot. The relative cost of > accessing memory has massively increased, as memory > speeds have not increased at anything like the rate of CPU speeds (though > the Apple M1 system on chips family has changed this) > so memory -> CPU is often the bottleneck, and optimising CPU performance > through clever algorithms often merely results in > the CPU being stuck in stall cycles waiting for cache misses to be > serviced. > That said, some of the points Mike makes are still valid. One is the > massive cost of breaking up lines - which slows down performance > by an order of magnitude, by forcing you to examine every byte. So > searching for the search-text, and then when you find it, searching > backwards > from that point to find the enclosing line, is a big win. > The other point he makes which is still valid is to consider the cost of > copying data. > This will probably be the dominant cost. Running `cat > /dev/null` will > tell you how long it takes to merely read the data, and this will give > you a baseline figure. A reasonably carefully written grep clone should > achieve roughly the same runtime. > > As you specified the problem as reading from stdin, using memory mapped > files, using faster filesystems and fast SSDs will not help you. > > And, as always in Go, allocations trump almost everything else. If you end > up converting each line into a string, forcing allocations and GC work, > that will knock orders of magnitude off your performance. > > Disclaimer: I have not actually written any code or run any benchmarks > here. But there are some interesting questions which could make an > interesting paper. To what extent has Hume and Sunday's 1991 paper been > invalidated by hardware changes? Are smart algorithm's which > avoid looking at every byte still a win, or do they actually slow down > performance by adding branch mispredictions, and preventing vectorisation > optimisations? To what extent does the Apple/Arm system on chip family, > with their much lower memory latency, change these considerations? > How much slower will a simple Go clone of grep be than the C version which > has had decades of optimisation? > And how much effort would it take to optimise the Go clone to reach the > performance of the C version? > If anyone wants to write the code, do the benchmarks, and publish the > conclusions, then send me a link. It will be an interesting read. > > > On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote: > >> I cannot use fgrep in this environment. >> >> Looks like Scan buffer is resizable. >> >> https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190 >> // Is the buffer full? If so, resize. >> if s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .end >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36> >> >> == len(s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .buf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) >> >> { >> // Guarantee no overflow in the multiplication below. >> const maxInt >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=maxInt&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23const%2520%255Bscan.go%2523192%255D.maxInt> >> >> = int >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=int&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23int%2523builtin> >> (^uint >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=uint&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23uint%2523builtin>(0) >> >> >> 1) >> if len(s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .buf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) >> >> >= s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .maxTokenSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32> >> >> || len(s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .buf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) >> >> > maxInt >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=192>/2 >> >> { >> s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .setErr >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=252> >> (ErrTooLong >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=69>) >> >> >> return false >> } >> newSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=198?gsn=newSize&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523197%255D.newSize> >> >> := len(s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .buf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>) >> >> * 2 >> if newSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> >> >> == 0 { >> newSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> >> >> = startBufSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=82> >> >> } >> if newSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> >> >> > s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .maxTokenSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32> >> >> { >> newSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197> >> >> = s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .maxTokenSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32> >> >> } >> newBuf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=newBuf&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523204%255D.newBuf> >> >> := make([]byte >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=byte&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23byte%2523builtin>, >> >> newSize >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>) >> >> >> copy(newBuf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>, >> >> s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .buf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34> >> [s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .start >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35> >> :s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .end >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>]) >> >> >> s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .buf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34> >> >> = newBuf >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204> >> >> s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .end >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36> >> >> -= s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .start >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35> >> >> s >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135> >> .start >> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35> >> >> = 0 >> } >> On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote: >> >>> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote: >>> > The question is will scanner.Scan handle a line of 100s MB? >>> >>> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer). >>> But that that point you want to start questioning why you're doing what >>> you're doing. >>> >>> Your invocation of grep can be a lot simpler >>> >>> ``` >>> func grep(match string, r io.Reader, stdout, stderr io.Writer) error { >>> cmd := exec.Command("fgrep", match) >>> cmd.Stdin = r >>> cmd.Stdout = stdout >>> cmd.Stderr = stderr >>> return cmd.Run() >>> } >>> ``` >>> >>> You can then do whatever you want with the buffered output and stderr. >>> You can also make this use a pipe with minor changes if you expect the >>> output to be long and that stream processing of that would be sensible >>> (use cmd.Start instead of Run and look into StdoutPipe). >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/e0ae7c9e-211b-424b-9cff-81cd65666fa7n%40googlegroups.com.