Using r.ReadLine() I can successfully read 100 MB line in a string, using 
the following conditional statement which is 
increasing the buffer until '\n' is encountered.
for isPrefix && err == nil {
line, isPrefix, err = r.ReadLine()
ln = append(ln, line...)
}

Now the last problem is how to search a substring in 100MB string

On Saturday, May 7, 2022 at 10:25:29 PM UTC-7 Amnon wrote:

> So you raise a couple of questions:
>
> 1) How about handling runes?
> The nice thing about utf8 is you don't have to care. If you are searching 
> for the word ascii byte 'test', you can 
> simply compare byte by byte - the letter t is represented by 0x74, and 
> this byte in the search buffer can 
> only represent t - it can never be part of a multi-byte-rune - not in utf8.
>
> 2) Can we handle lines of hundreds of MB? 
> Not really. The default will limit MaxTokenSize to 64K. 
> If you set your own buffer (recommended) then the size of this buffer will 
> be your maximum line length.
> Unfortunately you can't really solve this problem without slurping the 
> entire file into memory. Because 
> the whole file could consist of a single line, with the word "test" at the 
> very end - and at this point, you will 
> still need access to the beginning of the file in order to output it.
>
> 3) The Mike Haertel message (from 2010) is interesting, as is the 
> reference to the paper
> "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 
> 1991 issue of Software Practice & Experience
> All this is a quite old. Computer architecture has changed significantly 
> since then, CPUs have got a lot faster and added vector
> instructions which speed up dumb algorithms a lot. The relative cost of 
> accessing memory has massively increased, as memory
> speeds have not increased at anything like the rate of CPU speeds (though 
> the Apple M1 system on chips family has changed this)
> so memory -> CPU is often the bottleneck,  and optimising CPU performance 
> through clever algorithms often merely results in 
> the CPU being stuck in stall cycles waiting for cache misses to be 
> serviced.
> That said, some of the points Mike makes are still valid. One is the 
> massive cost of breaking up lines - which slows down performance
> by an order of magnitude, by forcing you to examine every byte. So 
> searching for the search-text, and then when you find it, searching 
> backwards
> from that point to find the enclosing line, is a big win.
> The other point he makes which is still valid is to consider the cost of 
> copying data. 
> This will probably be the dominant cost. Running `cat > /dev/null` will 
> tell you how long it takes to merely read the data, and this will give
> you a baseline figure. A reasonably carefully written grep clone should 
> achieve roughly the same runtime.
>
> As you specified the problem as reading from stdin, using memory mapped 
> files, using faster filesystems and fast SSDs will not help you.
>
> And, as always in Go, allocations trump almost everything else. If you end 
> up converting each line into a string, forcing allocations and GC work,
> that will knock orders of magnitude off your performance. 
>
> Disclaimer: I have not actually written any code or run any benchmarks 
> here. But there are some interesting questions which could make an 
> interesting paper. To what extent has Hume and Sunday's 1991 paper been 
> invalidated by hardware changes? Are smart algorithm's which 
> avoid looking at every byte still a win, or do they actually slow down 
> performance by adding branch mispredictions, and preventing vectorisation
> optimisations? To what extent does the Apple/Arm system on chip family, 
> with their much lower memory latency, change these considerations?
> How much slower will a simple Go clone of grep be than the C version which 
> has had decades of optimisation?
> And how much effort would it take to optimise the Go clone to reach the 
> performance of the C version?
> If anyone wants to write the code, do the benchmarks, and publish the 
> conclusions, then send me a link. It will be an interesting read.
>
>
> On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote:
>
>> I cannot use fgrep in this environment.
>>
>> Looks like Scan buffer is resizable.
>>
>> https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190
>> // Is the buffer full? If so, resize. 
>>        if s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .end 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>
>>  
>> == len(s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .buf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>  
>> { 
>>            // Guarantee no overflow in the multiplication below. 
>>            const maxInt 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=maxInt&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23const%2520%255Bscan.go%2523192%255D.maxInt>
>>  
>> = int 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=int&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23int%2523builtin>
>> (^uint 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=193?gsn=uint&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23uint%2523builtin>(0)
>>  
>> >> 1) 
>>            if len(s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .buf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>  
>> >= s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .maxTokenSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>>  
>> || len(s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .buf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>  
>> > maxInt 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=192>/2
>>  
>> { 
>>                s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .setErr 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=252>
>> (ErrTooLong 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=69>)
>>  
>>
>>                return false 
>>            } 
>>            newSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=198?gsn=newSize&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523197%255D.newSize>
>>  
>> := len(s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .buf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>)
>>  
>> * 2 
>>            if newSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>  
>> == 0 { 
>>                newSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>  
>> = startBufSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=82>
>>  
>>            } 
>>            if newSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>  
>> > s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .maxTokenSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>>  
>> { 
>>                newSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>
>>  
>> = s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .maxTokenSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=32>
>>  
>>            } 
>>            newBuf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=newBuf&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%3Fpath%3Dbufio%23var%2520%255Bscan.go%2523204%255D.newBuf>
>>  
>> := make([]byte 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;bpv=1;bpt=1;l=205?gsn=byte&gs=kythe%3A%2F%2Fgo.googlesource.com%2Fgo%3Flang%3Dgo%23byte%2523builtin>,
>>  
>> newSize 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=197>)
>>  
>>
>>            copy(newBuf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>,
>>  
>> s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .buf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>
>> [s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .start 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>> :s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .end 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>])
>>  
>>
>>            s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .buf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=34>
>>  
>> = newBuf 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=204>
>>  
>>            s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .end 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=36>
>>  
>> -= s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .start 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>>  
>>            s 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=135>
>> .start 
>> <https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;drc=a131fd1313e0056ad094d234c67648409d081b8c;l=35>
>>  
>> = 0 
>>        }
>> On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote:
>>
>>> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote: 
>>> > The question is will scanner.Scan handle a line of 100s MB? 
>>>
>>> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer). 
>>> But that that point you want to start questioning why you're doing what 
>>> you're doing. 
>>>
>>> Your invocation of grep can be a lot simpler 
>>>
>>> ``` 
>>> func grep(match string, r io.Reader, stdout, stderr io.Writer) error { 
>>> cmd := exec.Command("fgrep", match) 
>>> cmd.Stdin = r 
>>> cmd.Stdout = stdout 
>>> cmd.Stderr = stderr 
>>> return cmd.Run() 
>>> } 
>>> ``` 
>>>
>>> You can then do whatever you want with the buffered output and stderr. 
>>> You can also make this use a pipe with minor changes if you expect the 
>>> output to be long and that stream processing of that would be sensible 
>>> (use cmd.Start instead of Run and look into StdoutPipe). 
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/e0ae7c9e-211b-424b-9cff-81cd65666fa7n%40googlegroups.com.

Reply via email to