So you raise a couple of questions:

1) How about handling runes?
The nice thing about utf8 is you don't have to care. If you are searching 
for the word ascii byte 'test', you can 
simply compare byte by byte - the letter t is represented by 0x74, and this 
byte in the search buffer can 
only represent t - it can never be part of a multi-byte-rune - not in utf8.

2) Can we handle lines of hundreds of MB? 
Not really. The default will limit MaxTokenSize to 64K. 
If you set your own buffer (recommended) then the size of this buffer will 
be your maximum line length.
Unfortunately you can't really solve this problem without slurping the 
entire file into memory. Because 
the whole file could consist of a single line, with the word "test" at the 
very end - and at this point, you will 
still need access to the beginning of the file in order to output it.

3) The Mike Haertel message (from 2010) is interesting, as is the reference 
to the paper
"Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 
1991 issue of Software Practice & Experience
All this is a quite old. Computer architecture has changed significantly 
since then, CPUs have got a lot faster and added vector
instructions which speed up dumb algorithms a lot. The relative cost of 
accessing memory has massively increased, as memory
speeds have not increased at anything like the rate of CPU speeds (though 
the Apple M1 system on chips family has changed this)
so memory -> CPU is often the bottleneck,  and optimising CPU performance 
through clever algorithms often merely results in 
the CPU being stuck in stall cycles waiting for cache misses to be serviced.
That said, some of the points Mike makes are still valid. One is the 
massive cost of breaking up lines - which slows down performance
by an order of magnitude, by forcing you to examine every byte. So 
searching for the search-text, and then when you find it, searching 
backwards
from that point to find the enclosing line, is a big win.
The other point he makes which is still valid is to consider the cost of 
copying data. 
This will probably be the dominant cost. Running `cat > /dev/null` will 
tell you how long it takes to merely read the data, and this will give
you a baseline figure. A reasonably carefully written grep clone should 
achieve roughly the same runtime.

As you specified the problem as reading from stdin, using memory mapped 
files, using faster filesystems and fast SSDs will not help you.

And, as always in Go, allocations trump almost everything else. If you end 
up converting each line into a string, forcing allocations and GC work,
that will knock orders of magnitude off your performance. 

Disclaimer: I have not actually written any code or run any benchmarks 
here. But there are some interesting questions which could make an 
interesting paper. To what extent has Hume and Sunday's 1991 paper been 
invalidated by hardware changes? Are smart algorithm's which 
avoid looking at every byte still a win, or do they actually slow down 
performance by adding branch mispredictions, and preventing vectorisation
optimisations? To what extent does the Apple/Arm system on chip family, 
with their much lower memory latency, change these considerations?
How much slower will a simple Go clone of grep be than the C version which 
has had decades of optimisation?
And how much effort would it take to optimise the Go clone to reach the 
performance of the C version?
If anyone wants to write the code, do the benchmarks, and publish the 
conclusions, then send me a link. It will be an interesting read.


On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote:

> I cannot use fgrep in this environment.
>
> Looks like Scan buffer is resizable.
>
> https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190
> // Is the buffer full? If so, resize. 
>        if s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .end 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>
>  
> == len(s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .buf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>  
> { 
>            // Guarantee no overflow in the multiplication below. 
>            const maxInt 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=maxInt&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23const%2520%255Bscan.go%2523192%255D.maxInt>
>  
> = int 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=int&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23int%2523builtin>
> (^uint 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=uint&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23uint%2523builtin>(0)
>  
> >> 1) 
>            if len(s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .buf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>  
> >= s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .maxTokenSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>  
> || len(s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .buf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>  
> > maxInt 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=192>/2
>  
> { 
>                s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .setErr 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=252>
> (ErrTooLong 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=69>)
>  
>
>                return false 
>            } 
>            newSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=198?gsn=newSize&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523197%255D.newSize>
>  
> := len(s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .buf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>  
> * 2 
>            if newSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>  
> == 0 { 
>                newSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>  
> = startBufSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=82>
>  
>            } 
>            if newSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>  
> > s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .maxTokenSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>  
> { 
>                newSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>  
> = s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .maxTokenSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>  
>            } 
>            newBuf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=newBuf&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523204%255D.newBuf>
>  
> := make([]byte 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=byte&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23byte%2523builtin>,
>  
> newSize 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>)
>  
>
>            copy(newBuf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>,
>  
> s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .buf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>
> [s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .start 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
> :s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .end 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>])
>  
>
>            s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .buf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>
>  
> = newBuf 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>
>  
>            s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .end 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>
>  
> -= s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .start 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>  
>            s 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
> .start 
> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>  
> = 0 
>        }
> On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote:
>
>> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote: 
>> > The question is will scanner.Scan handle a line of 100s MB? 
>>
>> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer). 
>> But that that point you want to start questioning why you're doing what 
>> you're doing. 
>>
>> Your invocation of grep can be a lot simpler 
>>
>> ``` 
>> func grep(match string, r io.Reader, stdout, stderr io.Writer) error { 
>> cmd := exec.Command("fgrep", match) 
>> cmd.Stdin = r 
>> cmd.Stdout = stdout 
>> cmd.Stderr = stderr 
>> return cmd.Run() 
>> } 
>> ``` 
>>
>> You can then do whatever you want with the buffered output and stderr. 
>> You can also make this use a pipe with minor changes if you expect the 
>> output to be long and that stream processing of that would be sensible 
>> (use cmd.Start instead of Run and look into StdoutPipe). 
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/b77a013d-b513-4a7c-a5e4-91f1c0d8e6d1n%40googlegroups.com.

Reply via email to