Way over complicating this. Use a buffered reader. Keep track of the position the last newline was seen. It is a trivial state machine the find ‘test’ continue to next newline. Seek to stored last newline position and buffered read and write to stdout until next newline.
> On May 8, 2022, at 3:28 PM, Const V <thst...@gmail.com> wrote: > > pre-allocating a buffer is not an option, it should be dynamic > >> On Sunday, May 8, 2022 at 1:24:40 PM UTC-7 Barnim Dzwillo wrote: >> I had a similar use case in the past and got the best performance when using >> ReadSlice() instead of scanner.Scan(). >> See sample code here: https://go.dev/play/p/EfvadCURcXt >> >>> On Sunday, May 8, 2022 at 7:25:29 AM UTC+2 Amnon wrote: >>> So you raise a couple of questions: >>> >>> 1) How about handling runes? >>> The nice thing about utf8 is you don't have to care. If you are searching >>> for the word ascii byte 'test', you can >>> simply compare byte by byte - the letter t is represented by 0x74, and this >>> byte in the search buffer can >>> only represent t - it can never be part of a multi-byte-rune - not in utf8. >>> >>> 2) Can we handle lines of hundreds of MB? >>> Not really. The default will limit MaxTokenSize to 64K. >>> If you set your own buffer (recommended) then the size of this buffer will >>> be your maximum line length. >>> Unfortunately you can't really solve this problem without slurping the >>> entire file into memory. Because >>> the whole file could consist of a single line, with the word "test" at the >>> very end - and at this point, you will >>> still need access to the beginning of the file in order to output it. >>> >>> 3) The Mike Haertel message (from 2010) is interesting, as is the reference >>> to the paper >>> "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November >>> 1991 issue of Software Practice & Experience >>> All this is a quite old. Computer architecture has changed significantly >>> since then, CPUs have got a lot faster and added vector >>> instructions which speed up dumb algorithms a lot. The relative cost of >>> accessing memory has massively increased, as memory >>> speeds have not increased at anything like the rate of CPU speeds (though >>> the Apple M1 system on chips family has changed this) >>> so memory -> CPU is often the bottleneck, and optimising CPU performance >>> through clever algorithms often merely results in >>> the CPU being stuck in stall cycles waiting for cache misses to be serviced. >>> That said, some of the points Mike makes are still valid. One is the >>> massive cost of breaking up lines - which slows down performance >>> by an order of magnitude, by forcing you to examine every byte. So >>> searching for the search-text, and then when you find it, searching >>> backwards >>> from that point to find the enclosing line, is a big win. >>> The other point he makes which is still valid is to consider the cost of >>> copying data. >>> This will probably be the dominant cost. Running `cat > /dev/null` will >>> tell you how long it takes to merely read the data, and this will give >>> you a baseline figure. A reasonably carefully written grep clone should >>> achieve roughly the same runtime. >>> >>> As you specified the problem as reading from stdin, using memory mapped >>> files, using faster filesystems and fast SSDs will not help you. >>> >>> And, as always in Go, allocations trump almost everything else. If you end >>> up converting each line into a string, forcing allocations and GC work, >>> that will knock orders of magnitude off your performance. >>> >>> Disclaimer: I have not actually written any code or run any benchmarks >>> here. But there are some interesting questions which could make an >>> interesting paper. To what extent has Hume and Sunday's 1991 paper been >>> invalidated by hardware changes? Are smart algorithm's which >>> avoid looking at every byte still a win, or do they actually slow down >>> performance by adding branch mispredictions, and preventing vectorisation >>> optimisations? To what extent does the Apple/Arm system on chip family, >>> with their much lower memory latency, change these considerations? >>> How much slower will a simple Go clone of grep be than the C version which >>> has had decades of optimisation? >>> And how much effort would it take to optimise the Go clone to reach the >>> performance of the C version? >>> If anyone wants to write the code, do the benchmarks, and publish the >>> conclusions, then send me a link. It will be an interesting read. >>> >>> >>>> On Sunday, 8 May 2022 at 00:55:28 UTC+1 Const V wrote: >>>> I cannot use fgrep in this environment. >>>> >>>> Looks like Scan buffer is resizable. >>>> https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/bufio/scan.go;l=190 >>>> // Is the buffer full? If so, resize. >>>> if s.end == len(s.buf) { >>>> // Guarantee no overflow in the multiplication below. >>>> const maxInt = int(^uint(0) >> 1) >>>> if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 { >>>> s.setErr(ErrTooLong) >>>> return false >>>> } >>>> newSize := len(s.buf) * 2 >>>> if newSize == 0 { >>>> newSize = startBufSize >>>> } >>>> if newSize > s.maxTokenSize { >>>> newSize = s.maxTokenSize >>>> } >>>> newBuf := make([]byte, newSize) >>>> copy(newBuf, s.buf[s.start:s.end]) >>>> s.buf = newBuf >>>> s.end -= s.start >>>> s.start = 0 >>>> } >>>>> On Saturday, May 7, 2022 at 4:43:59 PM UTC-7 kortschak wrote: >>>>> On Sat, 2022-05-07 at 16:16 -0700, Const V wrote: >>>>> > The question is will scanner.Scan handle a line of 100s MB? >>>>> >>>>> No, at least not by default (https://pkg.go.dev/bufio#Scanner.Buffer). >>>>> But that that point you want to start questioning why you're doing what >>>>> you're doing. >>>>> >>>>> Your invocation of grep can be a lot simpler >>>>> >>>>> ``` >>>>> func grep(match string, r io.Reader, stdout, stderr io.Writer) error { >>>>> cmd := exec.Command("fgrep", match) >>>>> cmd.Stdin = r >>>>> cmd.Stdout = stdout >>>>> cmd.Stderr = stderr >>>>> return cmd.Run() >>>>> } >>>>> ``` >>>>> >>>>> You can then do whatever you want with the buffered output and stderr. >>>>> You can also make this use a pipe with minor changes if you expect the >>>>> output to be long and that stream processing of that would be sensible >>>>> (use cmd.Start instead of Run and look into StdoutPipe). >>>>> >>>>> > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/23d14932-17a3-4c68-a72d-f3e0b97dcb55n%40googlegroups.com. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/06747991-B4C3-4983-B084-A24A007D533A%40ix.netcom.com.