Re: [go-nuts] Which is the most efficient way to read STDIN lines 100s of MB long and tens of MB info is passed to it

Robert Engels Sun, 08 May 2022 14:31:40 -0700

Way over complicating this. Use a buffered reader. Keep track of the position 
the last newline was seen. It is a trivial state machine the find ‘test’  
continue to next newline. Seek to stored last newline position and buffered 
read and write to stdout until next newline.



> On May 8, 2022, at 3:28 PM, Const V <thst...@gmail.com> wrote:
> 
> pre-allocating a buffer is not an option, it should be dynamic
> 
>> On Sunday, May 8, 2022 at 1:24:40 PM UTC-7 Barnim Dzwillo wrote:
>> I had a similar use case in the past and got the best performance when using 
>> ReadSlice() instead of scanner.Scan().
>> See sample code here: https://go.dev/play/p/EfvadCURcXt
>> 
>>> On Sunday, May 8, 2022 at 7:25:29 AM UTC+2 Amnon wrote:
>>> So you raise a couple of questions:
>>> 
>>> 1) How about handling runes?
>>> The nice thing about utf8 is you don't have to care. If you are searching 
>>> for the word ascii byte 'test', you can 
>>> simply compare byte by byte - the letter t is represented by 0x74, and this 
>>> byte in the search buffer can 
>>> only represent t - it can never be part of a multi-byte-rune - not in utf8.
>>> 
>>> 2) Can we handle lines of hundreds of MB? 
>>> Not really. The default will limit MaxTokenSize to 64K. 
>>> If you set your own buffer (recommended) then the size of this buffer will 
>>> be your maximum line length.
>>> Unfortunately you can't really solve this problem without slurping the 
>>> entire file into memory. Because 
>>> the whole file could consist of a single line, with the word "test" at the 
>>> very end - and at this point, you will 
>>> still need access to the beginning of the file in order to output it.
>>> 
>>> 3) The Mike Haertel message (from 2010) is interesting, as is the reference 
>>> to the paper
>>> "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 
>>> 1991 issue of Software Practice & Experience
>>> All this is a quite old. Computer architecture has changed significantly 
>>> since then, CPUs have got a lot faster and added vector
>>> instructions which speed up dumb algorithms a lot. The relative cost of 
>>> accessing memory has massively increased, as memory
>>> speeds have not increased at anything like the rate of CPU speeds (though 
>>> the Apple M1 system on chips family has changed this)
>>> so memory -> CPU is often the bottleneck,  and optimising CPU performance 
>>> through clever algorithms often merely results in 
>>> the CPU being stuck in stall cycles waiting for cache misses to be serviced.
>>> That said, some of the points Mike makes are still valid. One is the 
>>> massive cost of breaking up lines - which slows down performance
>>> by an order of magnitude, by forcing you to examine every byte. So 
>>> searching for the search-text, and then when you find it, searching 
>>> backwards
>>> from that point to find the enclosing line, is a big win.
>>> The other point he makes which is still valid is to consider the cost of 
>>> copying data. 
>>> This will probably be the dominant cost. Running `cat > /dev/null` will 
>>> tell you how long it takes to merely read the data, and this will give
>>> you a baseline figure. A reasonably carefully written grep clone should 
>>> achieve roughly the same runtime.
>>> 
>>> As you specified the problem as reading from stdin, using memory mapped 
>>> files, using faster filesystems and fast SSDs will not help you.
>>> 
>>> And, as always in Go, allocations trump almost everything else. If you end 
>>> up converting each line into a string, forcing allocations and GC work,
>>> that will knock orders of magnitude off your performance. 
>>> 
>>> Disclaimer: I have not actually written any code or run any benchmarks 
>>> here. But there are some interesting questions which could make an 
>>> interesting paper. To what extent has Hume and Sunday's 1991 paper been 
>>> invalidated by hardware changes? Are smart algorithm's which 
>>> avoid looking at every byte still a win, or do they actually slow down 
>>> performance by adding branch mispredictions, and preventing vectorisation
>>> optimisations? To what extent does the Apple/Arm system on chip family, 
>>> with their much lower memory latency, change these considerations?
>>> How much slower will a simple Go clone of grep be than the C version which 
>>> has had decades of optimisation?
>>> And how much effort would it take to optimise the Go clone to reach the 
>>> performance of the C version?
>>> If anyone wants to write the code, do the benchmarks, and publish the 
>>> conclusions, then send me a link. It will be an interesting read.
>>> 
>>> 
>>>> On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote:
>>>> I cannot use fgrep in this environment.
>>>> 
>>>> Looks like Scan buffer is resizable.
>>>> https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190
>>>> // Is the buffer full? If so, resize.
>>>>        if s.end == len(s.buf) {
>>>>            // Guarantee no overflow in the multiplication below.
>>>>            const maxInt = int(^uint(0) >> 1)
>>>>            if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 {
>>>>                s.setErr(ErrTooLong)
>>>>                return false
>>>>            }
>>>>            newSize := len(s.buf) * 2
>>>>            if newSize == 0 {
>>>>                newSize = startBufSize
>>>>            }
>>>>            if newSize > s.maxTokenSize {
>>>>                newSize = s.maxTokenSize
>>>>            }
>>>>            newBuf := make([]byte, newSize)
>>>>            copy(newBuf, s.buf[s.start:s.end])
>>>>            s.buf = newBuf
>>>>            s.end -= s.start
>>>>            s.start = 0
>>>>        }
>>>>> On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote:
>>>>> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote: 
>>>>> > The question is will scanner.Scan handle a line of 100s MB? 
>>>>> 
>>>>> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer). 
>>>>> But that that point you want to start questioning why you're doing what 
>>>>> you're doing. 
>>>>> 
>>>>> Your invocation of grep can be a lot simpler 
>>>>> 
>>>>> ``` 
>>>>> func grep(match string, r io.Reader, stdout, stderr io.Writer) error { 
>>>>> cmd := exec.Command("fgrep", match) 
>>>>> cmd.Stdin = r 
>>>>> cmd.Stdout = stdout 
>>>>> cmd.Stderr = stderr 
>>>>> return cmd.Run() 
>>>>> } 
>>>>> ``` 
>>>>> 
>>>>> You can then do whatever you want with the buffered output and stderr. 
>>>>> You can also make this use a pipe with minor changes if you expect the 
>>>>> output to be long and that stream processing of that would be sensible 
>>>>> (use cmd.Start instead of Run and look into StdoutPipe). 
>>>>> 
>>>>> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/23d14932-17a3-4c68-a72d-f3e0b97dcb55n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/06747991-B4C3-4983-B084-A24A007D533A%40ix.netcom.com.

Re: [go-nuts] Which is the most efficient way to read STDIN lines 100s of MB long and tens of MB info is passed to it

Reply via email to