Re: [go-nuts] Efficiently switch io.Reader to another decoder on error

Rory Campbell-Lange Tue, 14 Jan 2025 06:53:58 -0800

Thanks for finding that foolish error, Brian.

To wrap the thread up, the implementation below seems to work ok for reading 
both base64.RawStdEncoding and base64.StdEncoding encoded data using the 
base64.RawStdEncoding decoder.


Example usage:

    b64 := NewB64Translator(bytes.NewReader(encodedBytes))
    b, err := io.ReadAll(base64.NewDecoder(base64.RawStdEncoding, b64))

The implementation: 

    type B64Translator struct {
        br *bufio.Reader
    }

    func NewB64Translator(r io.Reader) *B64Translator {
        return &B64Translator{
            br: bufio.NewReader(r),
        }
    }

    // Read reads off the buffered reader expecting base64.StdEncoding bytes
    // with (potentially) 1-3 '=' padding characters at the end.
    // RawStdEncoding can be used for both StdEncoded and RawStdEncoded data
    // if the padding is removed.
    func (b *B64Translator) Read(p []byte) (n int, err error) {
        h := make([]byte, len(p))
        n, err = b.br.Read(h)
        if err != nil {
            return n, err
        }
        // check if there is any padding in the last three bytes
        tail := make([]byte, 3)
        if n > 3 {
            _ = copy(tail, h[n-3:n])
        } else {
            _ = copy(tail, h[:n])
        }
        c := bytes.Count(tail, []byte("="))
        copy(p, h[:n-c])
        return n - c, nil
    }

For larger data the "tail" approach seems to have a tiny speed improvement over 
a naive bytes.Count(b, []byte("=")) over the whole buffer.

Thanks to everyone for their help.

Rory

On 14/01/25, 'Brian Candler' via golang-nuts (golang-nuts@googlegroups.com) 
wrote:
> I was more or less right. The input string, which you encoded to 
> "Qm9uam91ciwgam95ZXV4IGxpb24K", contains an encoded newline at the end. 
> It's not spurious.
> 
> Confirmed by the "echo" pipeline I gave above, or in Go itself:
> https://go.dev/play/p/6kSxiCfCTo4
> 
> You can also confirm it by multiplying the length of the input by 3/4 
> 
> % echo -n "Qm9uam91ciwgam95ZXV4IGxpb24K" | wc -c
>       28
> 
> 28*3/4 = 21
> B o n j o u r
> , _ j o y e u
> x _ l i o n \n
> 
> 
> On Tuesday, 14 January 2025 at 10:10:22 UTC Brian Candler wrote:
> 
> > Sorry ignore that, I hadn't checked your playground link.
> >
> > On Tuesday, 14 January 2025 at 10:07:53 UTC Brian Candler wrote:
> >
> >> > AS I wrote earlier, I'm trying to avoid reading the entire email part 
> >> into memory to discover if I should use base64.StdEncoding or 
> >> base64.RawStdEncoding.
> >>
> >> As I asked before, why would you ever need to use RawStdEncoding? It just 
> >> means the MIME part was invalid, most likely corrupted/truncated.
> >>
> >> > One odd thing is that I'm getting extraneous newlines (shown by stars 
> >> in the output), eg:
> >>
> >> You are feeding two different inputs which do not differ by truncation 
> >> alone.
> >>
> >> % echo -n "Qm9uam91ciwgam95ZXV4IGxpb24K" | base64 -D | hexdump -c
> >> 0000000   B   o   n   j   o   u   r   ,       j   o   y   e   u   x
> >> 0000010   l   i   o   n  \n
> >> 0000015
> >>
> >> % echo -n "IkJvbmpvdXIsIGpveWV1eCBsaW9uIg==" | base64 -D | hexdump -c
> >> 0000000   "   B   o   n   j   o   u   r   ,       j   o   y   e   u   x
> >> 0000010       l   i   o   n   "
> >> 0000016
> >>
> >> The second one has encoded double-quotes before and after the content.
> >>
> >> On Monday, 13 January 2025 at 22:43:51 UTC Rory Campbell-Lange wrote:
> >>
> >>> AS I wrote earlier, I'm trying to avoid reading the entire email part 
> >>> into memory to discover if I should use base64.StdEncoding or 
> >>> base64.RawStdEncoding. 
> >>>
> >>> The following seems to work reasonably well: 
> >>>
> >>> type B64Translator struct { 
> >>> br *bufio.Reader 
> >>> } 
> >>>
> >>> func NewB64Translator(r io.Reader) *B64Translator { 
> >>> return &B64Translator{ 
> >>> br: bufio.NewReader(r), 
> >>> } 
> >>> } 
> >>>
> >>> // Read reads off the buffered reader expecting base64.StdEncoding bytes 
> >>> // with (potentially) 1-3 '=' padding characters at the end. 
> >>> // RawStdEncoding can be used for both StdEncoded and RawStdEncoded data 
> >>> // if the padding is removed. 
> >>> func (b *B64Translator) Read(p []byte) (n int, err error) { 
> >>> h := make([]byte, len(p)) 
> >>> n, err = b.br.Read(h) 
> >>> if err != nil { 
> >>> return n, err 
> >>> } 
> >>> // to be optimised 
> >>> c := bytes.Count(h, []byte("=")) 
> >>> copy(p, h[:n-c]) 
> >>> // fmt.Println(string(h), n, string(p), n-c) 
> >>> return n - c, nil 
> >>> } 
> >>>
> >>> https://go.dev/play/p/H6ii7Vy-8as 
> >>>
> >>> One odd thing is that I'm getting extraneous newlines (shown by stars in 
> >>> the output), eg: 
> >>>
> >>> -- 
> >>> raw: Bonjour joyeux lion 
> >>> Qm9uam91ciwgam95ZXV4IGxpb24K 
> >>> ok: false 
> >>> decoded: Bonjour, joyeux lion* <-------------------- e.g. here 
> >>> -- 
> >>> std: "Bonjour, joyeux lion" 
> >>> IkJvbmpvdXIsIGpveWV1eCBsaW9uIg== 
> >>> ok: true 
> >>> decoded: "Bonjour, joyeux lion" 
> >>> -- 
> >>>
> >>> Any thoughts on that would be gratefully received. 
> >>>
> >>> Rory 
> >>>
> >>>
> >>> On 13/01/25, Rory Campbell-Lange (ro...@campbell-lange.net) wrote: 
> >>> > Thanks very much for the playground link and thoughts. 
> >>> > 
> >>> > The use case is reading base64 email parts, which could be of a very 
> >>> large size. It is unclear when processing these parts if they are base64 
> >>> padded or not. 
> >>> > 
> >>> > I'm trying to avoid reading the entire email part into memory. 
> >>> Consequently I think your earlier idea of adding padding (or removing it) 
> >>> in a wrapper could work. Perhaps wrapping the reader with another using a 
> >>> bufio.Reader to track bytes read and detect EOF. At EOF the wrapper could 
> >>> add padding if needed. 
> >>> > 
> >>> > Rory 
> >>> > 
> >>> > On 13/01/25, Axel Wagner (axel.wa...@googlemail.com) wrote: 
> >>> > > Just realized: If you twist the idea around, you get something easy 
> >>> to 
> >>> > > implement and more correct. 
> >>> > > Instead of stripping padding if it exist, you can ensure that the 
> >>> body *is* 
> >>> > > padded to a multiple of 4 bytes: https://go.dev/play/p/SsPRXV9ZfoS 
> >>> > > You can then feed that to base64.StdEncoding. If the wrapped Reader 
> >>> returns 
> >>> > > padded Base64, this does nothing. If it returns unpadded Base64, it 
> >>> adds 
> >>> > > padding. If it returns incorrect Base64, it will create a padded 
> >>> stream, 
> >>> > > that will then get rejected by the Base64 decoder. 
> >>> > > 
> >>> > > On Mon, 13 Jan 2025 at 10:31, Axel Wagner <axel.wa...@googlemail.com> 
> >>>
> >>> > > wrote: 
> >>> > > 
> >>> > > > Hi, 
> >>> > > > 
> >>> > > > one way to solve your problem is to wrap the body into an 
> >>> io.Reader that 
> >>> > > > strips off everything after the first `=` it finds. That can then 
> >>> be fed to 
> >>> > > > base64.RawStdEncoding. This approach requires no extra buffering 
> >>> or copying 
> >>> > > > and is easy to implement: https://go.dev/play/p/CwcVz7oietI 
> >>> > > > 
> >>> > > > The downside is, that this will not verify that the body is 
> >>> *either* 
> >>> > > > correctly padded Base64 *or* unpadded Base64. So, it will not 
> >>> report an 
> >>> > > > error if fed something like "AAA=garbage". 
> >>> > > > That can be remedied by buffering up to four bytes and, when 
> >>> encountering 
> >>> > > > an EOF, check that there are at most three trailing `=` and that 
> >>> the total 
> >>> > > > length of the stream is divisible by four. It's more finicky to 
> >>> implement, 
> >>> > > > but it should also be possible without any extra copies and only 
> >>> requires a 
> >>> > > > very small extra buffer. 
> >>> > > > 
> >>> > > > On Sun, 12 Jan 2025 at 22:29, Rory Campbell-Lange <
> >>> ro...@campbell-lange.net> 
> >>> > > > wrote: 
> >>> > > > 
> >>> > > >> Thanks very much for the links, pointers and possible solution. 
> >>> > > >> 
> >>> > > >> Trying to read base64 standard (padded) encoded data with 
> >>> > > >> base64.RawStdEncoding can produce an error such as 
> >>> > > >> 
> >>> > > >> illegal base64 data at input byte <n> 
> >>> > > >> 
> >>> > > >> Reading base64 raw (unpadded) encoded data produces the EOF 
> >>> error. 
> >>> > > >> 
> >>> > > >> I'll go with trying to read the standard encoded data up to maybe 
> >>> 1MB and 
> >>> > > >> then switch to base64.RawStdEncoding if I hit the "illegal base64 
> >>> data" 
> >>> > > >> problem, maybe with reference to bufio.Reader which has most of 
> >>> the methods 
> >>> > > >> suggested below. 
> >>> > > >> 
> >>> > > >> Yes, the use of a "Rewind" method would be crucial. I guess this 
> >>> would 
> >>> > > >> need to: 
> >>> > > >> 1. error if more than one buffer of data has been read 
> >>> > > >> 2. else re-read from byte 0 
> >>> > > >> 
> >>> > > >> Thanks again very much for these suggestions. 
> >>> > > >> 
> >>> > > >> Rory 
> >>> > > >> 
> >>> > > >> On 12/01/25, robert engels (ren...@ix.netcom.com) wrote: 
> >>> > > >> > Also, see this 
> >>> > > >> 
> >>> https://stackoverflow.com/questions/69753478/use-base64-stdencoding-or-base64-rawstdencoding-to-decode-base64-string-in-go
> >>>  
> >>> > > >> as I expected the error should be reported earlier than the end 
> >>> of stream 
> >>> > > >> if the chosen format is wrong. 
> >>> > > >> > 
> >>> > > >> > > On Jan 12, 2025, at 2:57 PM, robert engels <
> >>> ren...@ix.netcom.com> 
> >>> > > >> wrote: 
> >>> > > >> > > 
> >>> > > >> > > Also, this is what Gemini provided which looks basically 
> >>> correct - 
> >>> > > >> but I think encapsulating it with a Rewind() method would be 
> >>> easier to 
> >>> > > >> understand. 
> >>> > > >> > > 
> >>> > > >> > > 
> >>> > > >> > > 
> >>> > > >> > > While Go doesn't have a built-in PushbackReader like some 
> >>> other 
> >>> > > >> languages (e.g., Java), you can implement similar functionality 
> >>> using a 
> >>> > > >> custom struct and a buffer. 
> >>> > > >> > > 
> >>> > > >> > > Here's an example implementation: 
> >>> > > >> > > 
> >>> > > >> > > package main 
> >>> > > >> > > 
> >>> > > >> > > import ( 
> >>> > > >> > > "bytes" 
> >>> > > >> > > "io" 
> >>> > > >> > > ) 
> >>> > > >> > > 
> >>> > > >> > > type PushbackReader struct { 
> >>> > > >> > > reader io.Reader 
> >>> > > >> > > buffer *bytes.Buffer 
> >>> > > >> > > } 
> >>> > > >> > > 
> >>> > > >> > > func NewPushbackReader(r io.Reader) *PushbackReader { 
> >>> > > >> > > return &PushbackReader{ 
> >>> > > >> > > reader: r, 
> >>> > > >> > > buffer: new(bytes.Buffer), 
> >>> > > >> > > } 
> >>> > > >> > > } 
> >>> > > >> > > 
> >>> > > >> > > func (p *PushbackReader) Read(b []byte) (n int, err error) { 
> >>> > > >> > > if p.buffer.Len() > 0 { 
> >>> > > >> > > return p.buffer.Read(b) 
> >>> > > >> > > } 
> >>> > > >> > > return p.reader.Read(b) 
> >>> > > >> > > } 
> >>> > > >> > > 
> >>> > > >> > > func (p *PushbackReader) UnreadByte() error { 
> >>> > > >> > > if p.buffer.Len() == 0 { 
> >>> > > >> > > return io.EOF 
> >>> > > >> > > } 
> >>> > > >> > > lastByte := p.buffer.Bytes()[p.buffer.Len()-1] 
> >>> > > >> > > p.buffer.Truncate(p.buffer.Len() - 1) 
> >>> > > >> > > p.buffer.WriteByte(lastByte) 
> >>> > > >> > > return nil 
> >>> > > >> > > } 
> >>> > > >> > > 
> >>> > > >> > > func (p *PushbackReader) Unread(buf []byte) error { 
> >>> > > >> > > if p.buffer.Len() == 0 { 
> >>> > > >> > > return io.EOF 
> >>> > > >> > > } 
> >>> > > >> > > p.buffer.Write(buf) 
> >>> > > >> > > return nil 
> >>> > > >> > > } 
> >>> > > >> > > 
> >>> > > >> > > func main() { 
> >>> > > >> > > // Example usage 
> >>> > > >> > > r := NewPushbackReader(bytes.NewBufferString("Hello, 
> >>> World!")) 
> >>> > > >> > > buf := make([]byte, 5) 
> >>> > > >> > > r.Read(buf) 
> >>> > > >> > > r.UnreadByte() 
> >>> > > >> > > r.Read(buf) 
> >>> > > >> > > } 
> >>> > > >> > > 
> >>> > > >> > > Explanation: 
> >>> > > >> > > PushbackReader struct: This struct holds the underlying 
> >>> io.Reader and 
> >>> > > >> a buffer to store the pushed-back bytes. 
> >>> > > >> > > NewPushbackReader: This function creates a new PushbackReader 
> >>> from an 
> >>> > > >> existing io.Reader. 
> >>> > > >> > > Read method: This method reads bytes from either the buffer 
> >>> (if it 
> >>> > > >> contains data) or the underlying reader. 
> >>> > > >> > > UnreadByte method: This method pushes back a single byte into 
> >>> the 
> >>> > > >> buffer. 
> >>> > > >> > > Unread method: This method pushes back a slice of bytes into 
> >>> the 
> >>> > > >> buffer. 
> >>> > > >> > > Important Considerations: 
> >>> > > >> > > The buffer size is not managed automatically. You may need to 
> >>> adjust 
> >>> > > >> the buffer size based on your use case. 
> >>> > > >> > > This implementation does not handle pushing back beyond the 
> >>> initially 
> >>> > > >> read data. If you need to support arbitrary pushback, you'll need 
> >>> a more 
> >>> > > >> complex solution. 
> >>> > > >> > > 
> >>> > > >> > > Generative AI is experimental. 
> >>> > > >> > > 
> >>> > > >> > >> On Jan 12, 2025, at 2:53 PM, Robert Engels <
> >>> ren...@ix.netcom.com> 
> >>> > > >> wrote: 
> >>> > > >> > >> 
> >>> > > >> > >> You can see the two pass reader here 
> >>> > > >> 
> >>> https://stackoverflow.com/questions/20666594/how-can-i-push-bytes-into-a-reader-in-go
> >>>  
> >>> > > >> > >> 
> >>> > > >> > >> But yea, the basic premise is that you buffer the data so 
> >>> you can 
> >>> > > >> rewind if needed 
> >>> > > >> > >> 
> >>> > > >> > >> Are you certain it is reading to the end to return EOF? It 
> >>> may be 
> >>> > > >> returning eof once the parsing fails. 
> >>> > > >> > >> 
> >>> > > >> > >> Otherwise I would expect this is being decoded wrong - eg 
> >>> the mime 
> >>> > > >> type or encoding type should tell you the correct format before 
> >>> you start 
> >>> > > >> decoding. 
> >>> > > >> > >> 
> >>> > > >> > >>> On Jan 12, 2025, at 2:46 PM, Rory Campbell-Lange < 
> >>> > > >> ro...@campbell-lange.net> wrote: 
> >>> > > >> > >>> 
> >>> > > >> > >>> Thanks for the suggestion of a ReadSeeker to wrap an 
> >>> io.Reader. 
> >>> > > >> > >>> 
> >>> > > >> > >>> My google fu must be deserting me. I can find 
> >>> PushbackReader 
> >>> > > >> implementations in Java, but the only similar thing for Go I 
> >>> could find was 
> >>> > > >> https://gitlab.com/osaki-lab/iowrapper. If you have a specific 
> >>> > > >> recommendation for a ReadSeeker wrapper to an io.Reader that 
> >>> would be great 
> >>> > > >> to know. 
> >>> > > >> > >>> 
> >>> > > >> > >>> Since the base64 decoding error I'm looking for is an EOF, 
> >>> I guess 
> >>> > > >> the wrapper approach will not work when the EOF byte position is 
> >>> > than the 
> >>> > > >> io.ReadSeeker buffer size. 
> >>> > > >> > >>> 
> >>> > > >> > >>> Rory 
> >>> > > >> > >>> 
> >>> > > >> > >>> On 12/01/25, robert engels (ren...@ix.netcom.com) wrote: 
> >>> > > >> > >>>> create a ReadSeeker that wraps the Reader providing the 
> >>> buffering 
> >>> > > >> (mark & reset) - normally the buffer only needs to be large 
> >>> enough to 
> >>> > > >> detect the format contained in the Reader. 
> >>> > > >> > >>>> 
> >>> > > >> > >>>> You can search Google for PushbackReader in Go and you’ll 
> >>> get a 
> >>> > > >> basic implementation. 
> >>> > > >> > >>>> 
> >>> > > >> > >>>>> On Jan 12, 2025, at 12:52 PM, Rory Campbell-Lange < 
> >>> > > >> ro...@campbell-lange.net> wrote: 
> >>> > > >> > >>> ... 
> >>> > > >> > >>>>> I'm attempting to rationalise the process [of avoiding 
> >>> reading 
> >>> > > >> email parts into byte slices] by simply wrapping the provided 
> >>> io.Reader 
> >>> > > >> with the necessary decoders to reduce memory usage and 
> >>> unnecessary 
> >>> > > >> processing. 
> >>> > > >> > >>>>> 
> >>> > > >> > >>>>> The wrapping strategy seems to work ok. However there is 
> >>> a 
> >>> > > >> particular issue in detecting base64.StdEncoding versus 
> >>> > > >> base64.RawStdEncoding, which requires draining the io.Reader 
> >>> using 
> >>> > > >> base64.StdEncoding and (based on the current implementation) 
> >>> switching to 
> >>> > > >> base64.RawStdEncoding if an io.ErrUnexpectedEOF is found. 
> >>> > > >> > >>>>> 
> >>> > > >> > >> 
> >>> > > >> > >> 
> >>> > > >> > >> -- 
> >>> > > >> > >> You received this message because you are subscribed to the 
> >>> Google 
> >>> > > >> Groups "golang-nuts" group. 
> >>> > > >> > >> To unsubscribe from this group and stop receiving emails 
> >>> from it, 
> >>> > > >> send an email to golang-nuts...@googlegroups.com <mailto: 
> >>> > > >> golang-nuts...@googlegroups.com>. 
> >>> > > >> > >> To view this discussion visit 
> >>> > > >> 
> >>> https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com
> >>>  
> >>> > > >> < 
> >>> > > >> 
> >>> https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com?utm_medium=email&utm_source=footer
> >>>  
> >>> > > >> >. 
> >>> > > >> > > 
> >>> > > >> > 
> >>> > > >> 
> >>> > > >> -- 
> >>> > > >> You received this message because you are subscribed to the 
> >>> Google Groups 
> >>> > > >> "golang-nuts" group. 
> >>> > > >> To unsubscribe from this group and stop receiving emails from it, 
> >>> send an 
> >>> > > >> email to golang-nuts...@googlegroups.com. 
> >>> > > >> To view this discussion visit 
> >>> > > >> 
> >>> https://groups.google.com/d/msgid/golang-nuts/Z4Q0AFRkkoNH52_B%40campbell-lange.net
> >>>  
> >>> > > >> . 
> >>> > > >> 
> >>> > > > 
> >>> > 
> >>> > -- 
> >>> > You received this message because you are subscribed to the Google 
> >>> Groups "golang-nuts" group. 
> >>> > To unsubscribe from this group and stop receiving emails from it, send 
> >>> an email to golang-nuts...@googlegroups.com. 
> >>> > To view this discussion visit 
> >>> https://groups.google.com/d/msgid/golang-nuts/Z4UQYJmuk7Oe6xSG%40campbell-lange.net.
> >>>  
> >>>
> >>>
> >>
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion visit 
> https://groups.google.com/d/msgid/golang-nuts/a990ab8b-7437-45f3-a0e5-81d9b7cab4a3n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/Z4Z6VkUeV3w3EOQS%40campbell-lange.net.

Re: [go-nuts] Efficiently switch io.Reader to another decoder on error

Reply via email to